Celestia Mark 1
Hybrid Multilingual Autoregressive Language Model(Model file and usage code will be uploaded soon).
Overview
Celestia Mark 1 is a leading-edge, mid-sized autoregressive language model built with a novel hybrid architecture that fuses Transformer, Mixture of Experts (MoE), and Chain of Experts (CoE) layers. It is designed for multi-domain and multilingual tasks, supporting code, math, education, and general reasoning. Celestia Mark 1 is currently undergoing incremental training and has already processed over 4 billion tokens (target: 10B tokens).
- Model Size: ~360M parameters
- Architecture: Hybrid (Transformer + MoE + CoE)
- Training Approach: Autoregressive (completion-ready), with fine-tuning support for classification, code, math, multilingual tasks, and more
- License: Apache 2.0
Training Domains and Languages
Celestia Mark 1 is trained on a rich and diverse set of datasets, featuring both human and programming languages:
Human Languages Used:
- English
- Hindi (Latin script)
- Arabic
- French
- German
- Spanish
- Italian
- Polish
- Greek
- Latin
Programming Languages Used (13 total):
- Python
- JavaScript
- TypeScript
- Java
- C
- C++
- C#
- Go
- Shell
- Bash
- HTML
- CSS
- SQL
Other Domains:
- Math (symbolic, numeric, and educational datasets)
- Education (FineWeb-Edu, Finemath-4plus)
- General web text (Common Corpus, FineWeb-2)
Performance Benchmarks
Model | Params | Tokens Trained | Loss | Perplexity | Accuracy | Architecture | Multilingual | Domains |
---|---|---|---|---|---|---|---|---|
Celestia Mark 1 | 360M | 4B (ongoing) | 2.9 | 25 | 47% | Transformer + MoE + CoE (Hybrid) | β Yes | General |
GPT-2 Medium | 345M | 40B | 3.3 | 28β35 | 35β43% | Dense Transformer | β No | English |
GPT-2 Large | 774M | 40B | 3.2 | 27β33 | 38β44% | Dense Transformer | β No | English |
Pythia-410M | 410M | 300B | 2.9 | 30 | ~42% | Dense Transformer | β No | English |
Pythia-1B | 1B | 300B | 2.7 | 27 | ~45% | Dense Transformer | β No | English |
CodeParrot | 110M | 22B | 2.7 | 30β35 | 37% | Dense Transformer (code-focused) | β No | Python code |
Qwen-1B | 1B | ~15B | 2.8 | 27 | 45% | Dense Transformer | β Yes | General |
Jamba-1.1B | 1.1B | 20B | 2.7 | 23 | 48% | Hybrid Transformer-Mamba | β Yes | General |
Phi-2 | 2.7B | 1.4T | 2.5 | 21 | ~52% | Dense Transformer, curated data | β Yes | General |
Llama-2 7B | 7B | 2T | 2.7 | 21 | ~52% | Dense Transformer | β Yes | General |
Mistral 7B | 7B | 1.5T | 2.6 | 19 | ~54% | Dense Transformer | β Yes | General |
Sources: Official model papers, leaderboards, OpenReview, Datawizz, DataCamp, Microsoft Research.
Why Celestia Mark 1 Is Superior
- Hybrid Architecture: Celestia Mark 1 alternates Transformer layers with Mixture of Experts (MoE) and Chain of Experts (CoE) blocks, enabling dynamic routing, specialization, and iterative reasoning. This hybrid design delivers better accuracy and generalization for a given model size compared to pure Transformer models.
- Multilingual & Multi-Domain: Trained on 10 human languages and 13 programming languages, as well as math and educational data, Celestia Mark 1 covers a vastly broader scope than similarly-sized models.
- Efficient Learning: Achieves competitive or superior loss, perplexity, and accuracy compared to much larger models trained on more data, due to efficient expert routing and architectural innovation.
- Generalization & Adaptability: Performs robustly on code, math, multilingual, and web text, while remaining easy to fine-tune for classification, translation, and symbolic reasoning.
- Open Weights & License: Released under Apache 2.0 for free research and commercial use.
Hybrid Architecture Explained
Celestia Mark 1βs architecture is designed for maximal flexibility and specialization:
- Transformer Layers: Provide standard attention-based modeling for generalization.
- Mixture of Experts (MoE): Multiple expert networks are selectively activated for each token, increasing model capacity and specialization without increasing compute for all tokens.
- Chain of Experts (CoE): Allows iterative refinement and multi-step reasoning, particularly beneficial for symbolic, mathematical, and code tasks.
This hybrid approach enables Celestia Mark 1 to outperform pure Transformers in multilingual, code, and math domains, even with fewer parameters and less data.
Limitations
Celestia Mark 1 is still undergoing incremental training. As such:
- Some factual outputs may be inaccurate or incomplete.
- Performance will continue to improve as training approaches the 100 billion token goal.
- For highly factual, up-to-date, or specialized knowledge, verification is recommended.
Usage
Celestia Mark 1 can be used for:
- Completions (default autoregressive)
- Fine-tuning: Classification, code generation, math, translation, and more
- Multilingual & multi-domain applications
See usage.py
for quick-start instructions.
License
Apache 2.0 β free for research and commercial use.
Contact
For support or questions, contact: naqeeb.ajk63@gmail.com
- Downloads last month
- 3