Update pipeline tag and add library name for `ettin-decoder-32m`

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +45 -42
README.md CHANGED
@@ -1,9 +1,11 @@
1
  ---
2
- license: mit
3
  language:
4
  - en
5
- pipeline_tag: fill-mask
 
 
6
  ---
 
7
  # Ettin: an Open Suite of Paired Encoders and Decoders
8
 
9
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -82,11 +84,11 @@ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
82
 
83
  Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
84
 
85
- 1. **Identical training data** - Same high-quality mixture across all models
86
- 2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
87
- 3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
88
- 4. **Consistent training recipe** - Three-phase training with 2T tokens
89
- 5. **Multiple scales** - From 17M to 1B parameters
90
 
91
  This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
92
 
@@ -94,10 +96,10 @@ This approach allows for true apples-to-apples comparisons between encoder and d
94
 
95
  The training data is publicly available and split across different phases:
96
 
97
- - **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
98
- - **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
99
- - **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
100
- - **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
101
 
102
  ## Model Family
103
 
@@ -143,13 +145,13 @@ These models demonstrate what happens when you continue training encoders as dec
143
  **Load as decoders** using `AutoModelForCausalLM`:
144
 
145
  | Size | Model | Parameters | Description | Download |
146
- |:-----|:------|:-----------|:------------|:---------|
147
  | XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
148
  | XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
149
- | Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) |
150
- | Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) |
151
- | Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) |
152
- | XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) | 1B | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) |
153
 
154
  **Example Usage for Cross-Objective Models:**
155
  ```python
@@ -174,9 +176,9 @@ All raw training checkpoints are available in the [jhu-clsp/ettin-checkpoints](h
174
  #### HuggingFace Format Checkpoints
175
  Each model repository contains multiple tagged versions representing different training stages:
176
 
177
- - **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
178
- - **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
179
- - **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
180
 
181
  ```python
182
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -209,27 +211,27 @@ This checkpoint availability enables detailed analysis of training dynamics, los
209
 
210
  Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
211
 
212
- - **Identical Training Data**: Same 2T token mixture across all models
213
- - **Matched Architectures**: Only attention patterns and objectives differ
214
- - **Open Everything**: Training data, model weights, and batch-level training order
215
- - **Multiple Scales**: Fair comparison from 17M to 1B parameters
216
- - **250+ Checkpoints**: Complete training trajectory analysis
217
 
218
  ### Use Cases for Researchers
219
 
220
- - **Architecture Studies**: Compare encoder vs decoder capabilities fairly
221
- - **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
222
- - **Scaling Laws**: Study how architectural advantages change with scale
223
- - **Transfer Learning**: Investigate cross-objective training effectiveness
224
- - **Replication Studies**: First open replication of ModernBERT training recipe
225
 
226
  ### Reproducibility
227
 
228
  All training artifacts are publicly available:
229
- - Training data with exact batch ordering
230
- - Model checkpoints every 8.5B tokens
231
- - Complete hyperparameter configurations
232
- - Training code and evaluation scripts
233
 
234
  ## Training Details
235
 
@@ -238,14 +240,14 @@ All training artifacts are publicly available:
238
  **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
239
 
240
  **Training Phases:**
241
- - **Pre-training**: 1.7T tokens with diverse data mixture
242
- - **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
243
- - **Decay phase**: 100B tokens with premium data sources
244
 
245
  **Key Features:**
246
- - Context length: Up to 8K tokens
247
- - Vocabulary: 50,368 tokens (ModernBERT tokenizer)
248
- - Deep but efficient architectures following MobileLLM principles
249
 
250
  ## Model Architecture
251
 
@@ -262,7 +264,7 @@ All training artifacts are publicly available:
262
 
263
  ### Encoder: Masked Language Modeling
264
  <details>
265
- <summary>Click to expand <strong>encoder</strong> usage examples</summary>
266
 
267
  ```python
268
  from transformers import AutoTokenizer, AutoModelForMaskedLM
@@ -296,7 +298,7 @@ print(f"Predictions: {predictions}")
296
  ### Decoder: Text Generation
297
 
298
  <details>
299
- <summary>Click to expand <strong>decoder text generation</strong></summary>
300
 
301
  ```python
302
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -783,7 +785,8 @@ def main():
783
  model.push_to_hub(run_name)
784
  except Exception:
785
  logging.error(
786
- f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
 
787
  f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
788
  f"and saving it using `model.push_to_hub('{run_name}')`."
789
  )
 
1
  ---
 
2
  language:
3
  - en
4
+ license: mit
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
  ---
8
+
9
  # Ettin: an Open Suite of Paired Encoders and Decoders
10
 
11
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
84
 
85
  Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
86
 
87
+ 1. **Identical training data** - Same high-quality mixture across all models
88
+ 2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
89
+ 3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
90
+ 4. **Consistent training recipe** - Three-phase training with 2T tokens
91
+ 5. **Multiple scales** - From 17M to 1B parameters
92
 
93
  This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
94
 
 
96
 
97
  The training data is publicly available and split across different phases:
98
 
99
+ - **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
100
+ - **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
101
+ - **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
102
+ - **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
103
 
104
  ## Model Family
105
 
 
145
  **Load as decoders** using `AutoModelForCausalLM`:
146
 
147
  | Size | Model | Parameters | Description | Download |
148
+ |:-----|:------|:-----------|:---------|:---------|
149
  | XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
150
  | XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
151
+ | Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-68m) |
152
+ | Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-150m) | 150M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-150m) |
153
+ | Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-400m) |
154
+ | XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | 1B | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-1b) |
155
 
156
  **Example Usage for Cross-Objective Models:**
157
  ```python
 
176
  #### HuggingFace Format Checkpoints
177
  Each model repository contains multiple tagged versions representing different training stages:
178
 
179
+ - **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
180
+ - **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
181
+ - **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
182
 
183
  ```python
184
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
211
 
212
  Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
213
 
214
+ - **Identical Training Data**: Same 2T token mixture across all models
215
+ - **Matched Architectures**: Only attention patterns and objectives differ
216
+ - **Open Everything**: Training data, model weights, and batch-level training order
217
+ - **Multiple Scales**: Fair comparison from 17M to 1B parameters
218
+ - **250+ Checkpoints**: Complete training trajectory analysis
219
 
220
  ### Use Cases for Researchers
221
 
222
+ - **Architecture Studies**: Compare encoder vs decoder capabilities fairly
223
+ - **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
224
+ - **Scaling Laws**: Study how architectural advantages change with scale
225
+ - **Transfer Learning**: Investigate cross-objective training effectiveness
226
+ - **Replication Studies**: First open replication of ModernBERT training recipe
227
 
228
  ### Reproducibility
229
 
230
  All training artifacts are publicly available:
231
+ - Training data with exact batch ordering
232
+ - Model checkpoints every 8.5B tokens
233
+ - Complete hyperparameter configurations
234
+ - Training code and evaluation scripts
235
 
236
  ## Training Details
237
 
 
240
  **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
241
 
242
  **Training Phases:**
243
+ - **Pre-training**: 1.7T tokens with diverse data mixture
244
+ - **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
245
+ - **Decay phase**: 100B tokens with premium data sources
246
 
247
  **Key Features:**
248
+ - Context length: Up to 8K tokens
249
+ - Vocabulary: 50,368 tokens (ModernBERT tokenizer)
250
+ - Deep but efficient architectures following MobileLLM principles
251
 
252
  ## Model Architecture
253
 
 
264
 
265
  ### Encoder: Masked Language Modeling
266
  <details>
267
+ <summary>Click to expand **encoder** usage examples</summary>
268
 
269
  ```python
270
  from transformers import AutoTokenizer, AutoModelForMaskedLM
 
298
  ### Decoder: Text Generation
299
 
300
  <details>
301
+ <summary>Click to expand **decoder text generation**</summary>
302
 
303
  ```python
304
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
785
  model.push_to_hub(run_name)
786
  except Exception:
787
  logging.error(
788
+ f"Error uploading model to the Hugging Face Hub:
789
+ {traceback.format_exc()}To upload it manually, you can run "
790
  f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
791
  f"and saving it using `model.push_to_hub('{run_name}')`."
792
  )