nielsr HF Staff commited on
Commit
7214475
Β·
verified Β·
1 Parent(s): 77ab863

Improve model card: update pipeline tag and add library name

Browse files

This PR enhances the model card for the Ettin encoder model by:

* Updating the `pipeline_tag` from `fill-mask` to `feature-extraction`. This more accurately reflects the primary capabilities of this encoder model, which excels at tasks like classification and retrieval through feature extraction, and improves its discoverability on the Hugging Face Hub.
* Adding `library_name: transformers`. This ensures compatibility with the Hugging Face `transformers` library is clearly indicated and enables the "Use in Transformers" badge and quick inference snippets on the Hub.
* Adding relevant `tags` such as `modernbert` and `encoder` for better discoverability and accurate categorization of the model.

Files changed (1) hide show
  1. README.md +47 -41
README.md CHANGED
@@ -1,9 +1,14 @@
1
  ---
2
- license: mit
3
  language:
4
  - en
5
- pipeline_tag: fill-mask
 
 
 
 
 
6
  ---
 
7
  # Ettin: an Open Suite of Paired Encoders and Decoders
8
 
9
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -82,11 +87,11 @@ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
82
 
83
  Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
84
 
85
- 1. **Identical training data** - Same high-quality mixture across all models
86
- 2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
87
- 3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
88
- 4. **Consistent training recipe** - Three-phase training with 2T tokens
89
- 5. **Multiple scales** - From 17M to 1B parameters
90
 
91
  This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
92
 
@@ -94,10 +99,10 @@ This approach allows for true apples-to-apples comparisons between encoder and d
94
 
95
  The training data is publicly available and split across different phases:
96
 
97
- - **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
98
- - **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
99
- - **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
100
- - **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
101
 
102
  ## Model Family
103
 
@@ -146,10 +151,10 @@ These models demonstrate what happens when you continue training encoders as dec
146
  |:-----|:------|:-----------|:------------|:---------|
147
  | XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
148
  | XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
149
- | Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) |
150
- | Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) |
151
- | Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) |
152
- | XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) | 1B | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) |
153
 
154
  **Example Usage for Cross-Objective Models:**
155
  ```python
@@ -174,9 +179,9 @@ All raw training checkpoints are available in the [jhu-clsp/ettin-checkpoints](h
174
  #### HuggingFace Format Checkpoints
175
  Each model repository contains multiple tagged versions representing different training stages:
176
 
177
- - **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
178
- - **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
179
- - **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
180
 
181
  ```python
182
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -209,27 +214,27 @@ This checkpoint availability enables detailed analysis of training dynamics, los
209
 
210
  Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
211
 
212
- - **Identical Training Data**: Same 2T token mixture across all models
213
- - **Matched Architectures**: Only attention patterns and objectives differ
214
- - **Open Everything**: Training data, model weights, and batch-level training order
215
- - **Multiple Scales**: Fair comparison from 17M to 1B parameters
216
- - **250+ Checkpoints**: Complete training trajectory analysis
217
 
218
  ### Use Cases for Researchers
219
 
220
- - **Architecture Studies**: Compare encoder vs decoder capabilities fairly
221
- - **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
222
- - **Scaling Laws**: Study how architectural advantages change with scale
223
- - **Transfer Learning**: Investigate cross-objective training effectiveness
224
- - **Replication Studies**: First open replication of ModernBERT training recipe
225
 
226
  ### Reproducibility
227
 
228
  All training artifacts are publicly available:
229
- - Training data with exact batch ordering
230
- - Model checkpoints every 8.5B tokens
231
- - Complete hyperparameter configurations
232
- - Training code and evaluation scripts
233
 
234
  ## Training Details
235
 
@@ -238,14 +243,14 @@ All training artifacts are publicly available:
238
  **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
239
 
240
  **Training Phases:**
241
- - **Pre-training**: 1.7T tokens with diverse data mixture
242
- - **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
243
- - **Decay phase**: 100B tokens with premium data sources
244
 
245
  **Key Features:**
246
- - Context length: Up to 8K tokens
247
- - Vocabulary: 50,368 tokens (ModernBERT tokenizer)
248
- - Deep but efficient architectures following MobileLLM principles
249
 
250
  ## Model Architecture
251
 
@@ -262,7 +267,7 @@ All training artifacts are publicly available:
262
 
263
  ### Encoder: Masked Language Modeling
264
  <details>
265
- <summary>Click to expand <strong>encoder</strong> usage examples</summary>
266
 
267
  ```python
268
  from transformers import AutoTokenizer, AutoModelForMaskedLM
@@ -296,7 +301,7 @@ print(f"Predictions: {predictions}")
296
  ### Decoder: Text Generation
297
 
298
  <details>
299
- <summary>Click to expand <strong>decoder text generation</strong></summary>
300
 
301
  ```python
302
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -783,7 +788,8 @@ def main():
783
  model.push_to_hub(run_name)
784
  except Exception:
785
  logging.error(
786
- f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
 
787
  f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
788
  f"and saving it using `model.push_to_hub('{run_name}')`."
789
  )
 
1
  ---
 
2
  language:
3
  - en
4
+ license: mit
5
+ pipeline_tag: feature-extraction
6
+ library_name: transformers
7
+ tags:
8
+ - modernbert
9
+ - encoder
10
  ---
11
+
12
  # Ettin: an Open Suite of Paired Encoders and Decoders
13
 
14
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
87
 
88
  Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
89
 
90
+ 1. **Identical training data** - Same high-quality mixture across all models
91
+ 2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
92
+ 3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
93
+ 4. **Consistent training recipe** - Three-phase training with 2T tokens
94
+ 5. **Multiple scales** - From 17M to 1B parameters
95
 
96
  This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
97
 
 
99
 
100
  The training data is publicly available and split across different phases:
101
 
102
+ - **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
103
+ - **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
104
+ - **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
105
+ - **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
106
 
107
  ## Model Family
108
 
 
151
  |:-----|:------|:-----------|:------------|:---------|
152
  | XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
153
  | XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
154
+ | Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-68m) |
155
+ | Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-150m) |
156
+ | Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-400m) |
157
+ | XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | 1B | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-1b) |
158
 
159
  **Example Usage for Cross-Objective Models:**
160
  ```python
 
179
  #### HuggingFace Format Checkpoints
180
  Each model repository contains multiple tagged versions representing different training stages:
181
 
182
+ - **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
183
+ - **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
184
+ - **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
185
 
186
  ```python
187
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
214
 
215
  Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
216
 
217
+ - **Identical Training Data**: Same 2T token mixture across all models
218
+ - **Matched Architectures**: Only attention patterns and objectives differ
219
+ - **Open Everything**: Training data, model weights, and batch-level training order
220
+ - **Multiple Scales**: Fair comparison from 17M to 1B parameters
221
+ - **250+ Checkpoints**: Complete training trajectory analysis
222
 
223
  ### Use Cases for Researchers
224
 
225
+ - **Architecture Studies**: Compare encoder vs decoder capabilities fairly
226
+ - **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
227
+ - **Scaling Laws**: Study how architectural advantages change with scale
228
+ - **Transfer Learning**: Investigate cross-objective training effectiveness
229
+ - **Replication Studies**: First open replication of ModernBERT training recipe
230
 
231
  ### Reproducibility
232
 
233
  All training artifacts are publicly available:
234
+ - Training data with exact batch ordering
235
+ - Model checkpoints every 8.5B tokens
236
+ - Complete hyperparameter configurations
237
+ - Training code and evaluation scripts
238
 
239
  ## Training Details
240
 
 
243
  **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
244
 
245
  **Training Phases:**
246
+ - **Pre-training**: 1.7T tokens with diverse data mixture
247
+ - **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
248
+ - **Decay phase**: 100B tokens with premium data sources
249
 
250
  **Key Features:**
251
+ - Context length: Up to 8K tokens
252
+ - Vocabulary: 50,368 tokens (ModernBERT tokenizer)
253
+ - Deep but efficient architectures following MobileLLM principles
254
 
255
  ## Model Architecture
256
 
 
267
 
268
  ### Encoder: Masked Language Modeling
269
  <details>
270
+ <summary>Click to expand **encoder** usage examples</summary>
271
 
272
  ```python
273
  from transformers import AutoTokenizer, AutoModelForMaskedLM
 
301
  ### Decoder: Text Generation
302
 
303
  <details>
304
+ <summary>Click to expand **decoder text generation**</summary>
305
 
306
  ```python
307
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
788
  model.push_to_hub(run_name)
789
  except Exception:
790
  logging.error(
791
+ f"Error uploading model to the Hugging Face Hub:
792
+ {traceback.format_exc()}To upload it manually, you can run "
793
  f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
794
  f"and saving it using `model.push_to_hub('{run_name}')`."
795
  )