--- license: apache-2.0 datasets: - HuggingFaceFW/fineweb-2 - loubnabnl/github-code-duplicate - HuggingFaceFW/fineweb-edu - HuggingFaceTB/finemath - PleIAs/common_corpus language: - en - hi - es - la - fr - de - el - pl - it - ar metrics: - accuracy - perplexity pipeline_tag: text2text-generation tags: - custom_code --- # Celestia Mark 1 **Hybrid Multilingual Autoregressive Language Model**(Model file and usage code will be uploaded soon). --- ## Overview Celestia Mark 1 is a leading-edge, mid-sized autoregressive language model built with a novel hybrid architecture that fuses Transformer, Mixture of Experts (MoE), and Chain of Experts (CoE) layers. It is designed for multi-domain and multilingual tasks, supporting code, math, education, and general reasoning. Celestia Mark 1 is currently undergoing incremental training and has already processed over **4 billion tokens** (target: 10B tokens). - **Model Size:** ~360M parameters - **Architecture:** Hybrid (Transformer + MoE + CoE) - **Training Approach:** Autoregressive (completion-ready), with fine-tuning support for classification, code, math, multilingual tasks, and more - **License:** Apache 2.0 --- ## Training Domains and Languages Celestia Mark 1 is trained on a rich and diverse set of datasets, featuring both human and programming languages: **Human Languages Used:** - English - Hindi (Latin script) - Arabic - French - German - Spanish - Italian - Polish - Greek - Latin **Programming Languages Used (13 total):** - Python - JavaScript - TypeScript - Java - C - C++ - C# - Go - Shell - Bash - HTML - CSS - SQL **Other Domains:** - Math (symbolic, numeric, and educational datasets) - Education (FineWeb-Edu, Finemath-4plus) - General web text (Common Corpus, FineWeb-2) --- ## Performance Benchmarks | Model | Params | Tokens Trained | Loss | Perplexity | Accuracy | Architecture | Multilingual | Domains | |----------------------|--------|----------------|-------|------------|----------|--------------------------------------------|--------------|---------------------------| | **Celestia Mark 1** | 360M | 4B (ongoing) | 2.9 | 25 | 47% | Transformer + MoE + CoE (Hybrid) | ✅ Yes | General | | GPT-2 Medium | 345M | 40B | 3.3 | 28–35 | 35–43% | Dense Transformer | ❌ No | English | | GPT-2 Large | 774M | 40B | 3.2 | 27–33 | 38–44% | Dense Transformer | ❌ No | English | | Pythia-410M | 410M | 300B | 2.9 | 30 | ~42% | Dense Transformer | ❌ No | English | | Pythia-1B | 1B | 300B | 2.7 | 27 | ~45% | Dense Transformer | ❌ No | English | | CodeParrot | 110M | 22B | 2.7 | 30–35 | 37% | Dense Transformer (code-focused) | ❌ No | Python code | | Qwen-1B | 1B | ~15B | 2.8 | 27 | 45% | Dense Transformer | ✅ Yes | General | | Jamba-1.1B | 1.1B | 20B | 2.7 | 23 | 48% | Hybrid Transformer-Mamba | ✅ Yes | General | | Phi-2 | 2.7B | 1.4T | 2.5 | 21 | ~52% | Dense Transformer, curated data | ✅ Yes | General | | Llama-2 7B | 7B | 2T | 2.7 | 21 | ~52% | Dense Transformer | ✅ Yes | General | | Mistral 7B | 7B | 1.5T | 2.6 | 19 | ~54% | Dense Transformer | ✅ Yes | General | *Sources: Official model papers, leaderboards, OpenReview, Datawizz, DataCamp, Microsoft Research.* --- ## Why Celestia Mark 1 Is Superior - **Hybrid Architecture:** Celestia Mark 1 alternates Transformer layers with Mixture of Experts (MoE) and Chain of Experts (CoE) blocks, enabling dynamic routing, specialization, and iterative reasoning. This hybrid design delivers better accuracy and generalization for a given model size compared to pure Transformer models. - **Multilingual & Multi-Domain:** Trained on 10 human languages and 13 programming languages, as well as math and educational data, Celestia Mark 1 covers a vastly broader scope than similarly-sized models. - **Efficient Learning:** Achieves competitive or superior loss, perplexity, and accuracy compared to much larger models trained on more data, due to efficient expert routing and architectural innovation. - **Generalization & Adaptability:** Performs robustly on code, math, multilingual, and web text, while remaining easy to fine-tune for classification, translation, and symbolic reasoning. - **Open Weights & License:** Released under Apache 2.0 for free research and commercial use. --- ## Hybrid Architecture Explained Celestia Mark 1’s architecture is designed for maximal flexibility and specialization: - **Transformer Layers:** Provide standard attention-based modeling for generalization. - **Mixture of Experts (MoE):** Multiple expert networks are selectively activated for each token, increasing model capacity and specialization without increasing compute for all tokens. - **Chain of Experts (CoE):** Allows iterative refinement and multi-step reasoning, particularly beneficial for symbolic, mathematical, and code tasks. This hybrid approach enables Celestia Mark 1 to outperform pure Transformers in multilingual, code, and math domains, even with fewer parameters and less data. --- ## Limitations Celestia Mark 1 is still undergoing incremental training. As such: - Some factual outputs may be inaccurate or incomplete. - Performance will continue to improve as training approaches the 100 billion token goal. - For highly factual, up-to-date, or specialized knowledge, verification is recommended. --- ## Usage Celestia Mark 1 can be used for: - **Completions** (default autoregressive) - **Fine-tuning:** Classification, code generation, math, translation, and more - **Multilingual & multi-domain** applications See [`usage.py`](./usage.py) for quick-start instructions. --- ## License Apache 2.0 — free for research and commercial use. --- ## Contact For support or questions, contact: **naqeeb.ajk63@gmail.com**