jhu-clsp
/

ettin-decoder-32m

@@ -1,9 +1,11 @@
 ---
-license: mit
 language:
 - en
-pipeline_tag: fill-mask
 ---
 # Ettin: an Open Suite of Paired Encoders and Decoders
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -82,11 +84,11 @@ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
 Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
-1. **Identical training data** - Same high-quality mixture across all models
-2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
-3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
-4. **Consistent training recipe** - Three-phase training with 2T tokens
-5. **Multiple scales** - From 17M to 1B parameters
 This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
@@ -94,10 +96,10 @@ This approach allows for true apples-to-apples comparisons between encoder and d
 The training data is publicly available and split across different phases:
-- **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
-- **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
-- **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
-- **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
 ## Model Family
@@ -143,13 +145,13 @@ These models demonstrate what happens when you continue training encoders as dec
 **Load as decoders** using `AutoModelForCausalLM`:
 | Size | Model | Parameters | Description | Download |
-|:-----|:------|:-----------|:------------|:---------|
 | XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
 | XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
-| Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) |
-| Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) |
-| Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) |
-| XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) | 1B | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) |
 **Example Usage for Cross-Objective Models:**
 ```python
@@ -174,9 +176,9 @@ All raw training checkpoints are available in the [jhu-clsp/ettin-checkpoints](h
 #### HuggingFace Format Checkpoints
 Each model repository contains multiple tagged versions representing different training stages:
-- **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
-- **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
-- **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -209,27 +211,27 @@ This checkpoint availability enables detailed analysis of training dynamics, los
 Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
-- **Identical Training Data**: Same 2T token mixture across all models
-- **Matched Architectures**: Only attention patterns and objectives differ
-- **Open Everything**: Training data, model weights, and batch-level training order
-- **Multiple Scales**: Fair comparison from 17M to 1B parameters
-- **250+ Checkpoints**: Complete training trajectory analysis
 ### Use Cases for Researchers
-- **Architecture Studies**: Compare encoder vs decoder capabilities fairly
-- **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
-- **Scaling Laws**: Study how architectural advantages change with scale
-- **Transfer Learning**: Investigate cross-objective training effectiveness
-- **Replication Studies**: First open replication of ModernBERT training recipe
 ### Reproducibility
 All training artifacts are publicly available:
-- Training data with exact batch ordering
-- Model checkpoints every 8.5B tokens
-- Complete hyperparameter configurations
-- Training code and evaluation scripts
 ## Training Details
@@ -238,14 +240,14 @@ All training artifacts are publicly available:
 **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
 **Training Phases:**
-- **Pre-training**: 1.7T tokens with diverse data mixture
-- **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
-- **Decay phase**: 100B tokens with premium data sources
 **Key Features:**
-- Context length: Up to 8K tokens
-- Vocabulary: 50,368 tokens (ModernBERT tokenizer)
-- Deep but efficient architectures following MobileLLM principles
 ## Model Architecture
@@ -262,7 +264,7 @@ All training artifacts are publicly available:
 ### Encoder: Masked Language Modeling
 <details>
-<summary>Click to expand <strong>encoder</strong> usage examples</summary>
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
@@ -296,7 +298,7 @@ print(f"Predictions: {predictions}")
 ### Decoder: Text Generation
 <details>
-<summary>Click to expand <strong>decoder text generation</strong></summary>
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -783,7 +785,8 @@ def main():
         model.push_to_hub(run_name)
     except Exception:
         logging.error(
-            f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
             f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
             f"and saving it using `model.push_to_hub('{run_name}')`."
         )

 ---
 language:
 - en
+license: mit
+pipeline_tag: text-generation
+library_name: transformers
 ---
 # Ettin: an Open Suite of Paired Encoders and Decoders
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
+1.  **Identical training data** - Same high-quality mixture across all models
+2.  **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
+3.  **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
+4.  **Consistent training recipe** - Three-phase training with 2T tokens
+5.  **Multiple scales** - From 17M to 1B parameters
 This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
 The training data is publicly available and split across different phases:
+-   **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
+-   **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
+-   **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
+-   **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
 ## Model Family
 **Load as decoders** using `AutoModelForCausalLM`:
 | Size | Model | Parameters | Description | Download |
+|:-----|:------|:-----------|:---------|:---------|
 | XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
 | XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
+| Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-68m) |
+| Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-150m) | 150M | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-150m) |
+| Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-400m) |
+| XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | 1B | Encoder → CLM continued training | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-1b) |
 **Example Usage for Cross-Objective Models:**
 ```python
 #### HuggingFace Format Checkpoints
 Each model repository contains multiple tagged versions representing different training stages:
+-   **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
+-   **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
+-   **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
+-   **Identical Training Data**: Same 2T token mixture across all models
+-   **Matched Architectures**: Only attention patterns and objectives differ
+-   **Open Everything**: Training data, model weights, and batch-level training order
+-   **Multiple Scales**: Fair comparison from 17M to 1B parameters
+-   **250+ Checkpoints**: Complete training trajectory analysis
 ### Use Cases for Researchers
+-   **Architecture Studies**: Compare encoder vs decoder capabilities fairly
+-   **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
+-   **Scaling Laws**: Study how architectural advantages change with scale
+-   **Transfer Learning**: Investigate cross-objective training effectiveness
+-   **Replication Studies**: First open replication of ModernBERT training recipe
 ### Reproducibility
 All training artifacts are publicly available:
+-   Training data with exact batch ordering
+-   Model checkpoints every 8.5B tokens
+-   Complete hyperparameter configurations
+-   Training code and evaluation scripts
 ## Training Details
 **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
 **Training Phases:**
+-   **Pre-training**: 1.7T tokens with diverse data mixture
+-   **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
+-   **Decay phase**: 100B tokens with premium data sources
 **Key Features:**
+-   Context length: Up to 8K tokens
+-   Vocabulary: 50,368 tokens (ModernBERT tokenizer)
+-   Deep but efficient architectures following MobileLLM principles
 ## Model Architecture
 ### Encoder: Masked Language Modeling
 <details>
+<summary>Click to expand **encoder** usage examples</summary>
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
 ### Decoder: Text Generation
 <details>
+<summary>Click to expand **decoder text generation**</summary>
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
         model.push_to_hub(run_name)
     except Exception:
         logging.error(
+            f"Error uploading model to the Hugging Face Hub:
+{traceback.format_exc()}To upload it manually, you can run "
             f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
             f"and saving it using `model.push_to_hub('{run_name}')`."
         )