---
license: apache-2.0
datasets:
- HuggingFaceFW/fineweb-2
- loubnabnl/github-code-duplicate
- HuggingFaceFW/fineweb-edu
- HuggingFaceTB/finemath
- PleIAs/common_corpus
language:
- en
- hi
- es
- la
- fr
- de
- el
- pl
- it
- ar
metrics:
- accuracy
- perplexity
pipeline_tag: text2text-generation
tags:
- custom_code
---
# Celestia Mark 1

**Hybrid Multilingual Autoregressive Language Model**(Model file and usage code will be uploaded soon).

---

## Overview

Celestia Mark 1 is a leading-edge, mid-sized autoregressive language model built with a novel hybrid architecture that fuses Transformer, Mixture of Experts (MoE), and Chain of Experts (CoE) layers. It is designed for multi-domain and multilingual tasks, supporting code, math, education, and general reasoning. Celestia Mark 1 is currently undergoing incremental training and has already processed over **4 billion tokens** (target: 10B tokens).

- **Model Size:** ~360M parameters
- **Architecture:** Hybrid (Transformer + MoE + CoE)
- **Training Approach:** Autoregressive (completion-ready), with fine-tuning support for classification, code, math, multilingual tasks, and more
- **License:** Apache 2.0

---

## Training Domains and Languages

Celestia Mark 1 is trained on a rich and diverse set of datasets, featuring both human and programming languages:

**Human Languages Used:**
- English
- Hindi (Latin script)
- Arabic
- French
- German
- Spanish
- Italian
- Polish
- Greek
- Latin

**Programming Languages Used (13 total):**
- Python
- JavaScript
- TypeScript
- Java
- C
- C++
- C#
- Go
- Shell
- Bash
- HTML
- CSS
- SQL

**Other Domains:**
- Math (symbolic, numeric, and educational datasets)
- Education (FineWeb-Edu, Finemath-4plus)
- General web text (Common Corpus, FineWeb-2)

---

## Performance Benchmarks

| Model                | Params | Tokens Trained | Loss  | Perplexity | Accuracy | Architecture                              | Multilingual | Domains                   |
|----------------------|--------|----------------|-------|------------|----------|--------------------------------------------|--------------|---------------------------|
| **Celestia Mark 1**  | 360M   | 4B (ongoing)   | 2.9   | 25         | 47%      | Transformer + MoE + CoE (Hybrid)           | ✅ Yes       | General                   |
| GPT-2 Medium         | 345M   | 40B            | 3.3   | 28–35      | 35–43%   | Dense Transformer                          | ❌ No        | English                   |
| GPT-2 Large          | 774M   | 40B            | 3.2   | 27–33      | 38–44%   | Dense Transformer                          | ❌ No        | English                   |
| Pythia-410M          | 410M   | 300B           | 2.9   | 30         | ~42%     | Dense Transformer                          | ❌ No        | English                   |
| Pythia-1B            | 1B     | 300B           | 2.7   | 27         | ~45%     | Dense Transformer                          | ❌ No        | English                   |
| CodeParrot           | 110M   | 22B            | 2.7   | 30–35      | 37%      | Dense Transformer (code-focused)           | ❌ No        | Python code               |
| Qwen-1B              | 1B     | ~15B           | 2.8   | 27         | 45%      | Dense Transformer                          | ✅ Yes       | General                   |
| Jamba-1.1B           | 1.1B   | 20B            | 2.7   | 23         | 48%      | Hybrid Transformer-Mamba                   | ✅ Yes       | General                   |
| Phi-2                | 2.7B   | 1.4T           | 2.5   | 21         | ~52%     | Dense Transformer, curated data            | ✅ Yes       | General                   |
| Llama-2 7B           | 7B     | 2T             | 2.7   | 21         | ~52%     | Dense Transformer                          | ✅ Yes       | General                   |
| Mistral 7B           | 7B     | 1.5T           | 2.6   | 19         | ~54%     | Dense Transformer                          | ✅ Yes       | General                   |

*Sources: Official model papers, leaderboards, OpenReview, Datawizz, DataCamp, Microsoft Research.*

---

## Why Celestia Mark 1 Is Superior

- **Hybrid Architecture:** Celestia Mark 1 alternates Transformer layers with Mixture of Experts (MoE) and Chain of Experts (CoE) blocks, enabling dynamic routing, specialization, and iterative reasoning. This hybrid design delivers better accuracy and generalization for a given model size compared to pure Transformer models.
- **Multilingual & Multi-Domain:** Trained on 10 human languages and 13 programming languages, as well as math and educational data, Celestia Mark 1 covers a vastly broader scope than similarly-sized models.
- **Efficient Learning:** Achieves competitive or superior loss, perplexity, and accuracy compared to much larger models trained on more data, due to efficient expert routing and architectural innovation.
- **Generalization & Adaptability:** Performs robustly on code, math, multilingual, and web text, while remaining easy to fine-tune for classification, translation, and symbolic reasoning.
- **Open Weights & License:** Released under Apache 2.0 for free research and commercial use.

---

## Hybrid Architecture Explained

Celestia Mark 1’s architecture is designed for maximal flexibility and specialization:

- **Transformer Layers:** Provide standard attention-based modeling for generalization.
- **Mixture of Experts (MoE):** Multiple expert networks are selectively activated for each token, increasing model capacity and specialization without increasing compute for all tokens.
- **Chain of Experts (CoE):** Allows iterative refinement and multi-step reasoning, particularly beneficial for symbolic, mathematical, and code tasks.

This hybrid approach enables Celestia Mark 1 to outperform pure Transformers in multilingual, code, and math domains, even with fewer parameters and less data.

---

## Limitations

Celestia Mark 1 is still undergoing incremental training. As such:
- Some factual outputs may be inaccurate or incomplete.
- Performance will continue to improve as training approaches the 100 billion token goal.
- For highly factual, up-to-date, or specialized knowledge, verification is recommended.

---

## Usage

Celestia Mark 1 can be used for:
- **Completions** (default autoregressive)
- **Fine-tuning:** Classification, code generation, math, translation, and more
- **Multilingual & multi-domain** applications

See [`usage.py`](./usage.py) for quick-start instructions.

---

## License

Apache 2.0 — free for research and commercial use.

---

## Contact

For support or questions, contact: **naqeeb.ajk63@gmail.com**