Improve model card: update pipeline tag and add library name
Browse filesThis PR enhances the model card for the Ettin encoder model by:
* Updating the `pipeline_tag` from `fill-mask` to `feature-extraction`. This more accurately reflects the primary capabilities of this encoder model, which excels at tasks like classification and retrieval through feature extraction, and improves its discoverability on the Hugging Face Hub.
* Adding `library_name: transformers`. This ensures compatibility with the Hugging Face `transformers` library is clearly indicated and enables the "Use in Transformers" badge and quick inference snippets on the Hub.
* Adding relevant `tags` such as `modernbert` and `encoder` for better discoverability and accurate categorization of the model.
README.md
CHANGED
@@ -1,9 +1,14 @@
|
|
1 |
---
|
2 |
-
license: mit
|
3 |
language:
|
4 |
- en
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
6 |
---
|
|
|
7 |
# Ettin: an Open Suite of Paired Encoders and Decoders
|
8 |
|
9 |
[](https://opensource.org/licenses/MIT)
|
@@ -82,11 +87,11 @@ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
|
|
82 |
|
83 |
Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
|
84 |
|
85 |
-
1.
|
86 |
-
2.
|
87 |
-
3.
|
88 |
-
4.
|
89 |
-
5.
|
90 |
|
91 |
This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
|
92 |
|
@@ -94,10 +99,10 @@ This approach allows for true apples-to-apples comparisons between encoder and d
|
|
94 |
|
95 |
The training data is publicly available and split across different phases:
|
96 |
|
97 |
-
-
|
98 |
-
-
|
99 |
-
-
|
100 |
-
-
|
101 |
|
102 |
## Model Family
|
103 |
|
@@ -146,10 +151,10 @@ These models demonstrate what happens when you continue training encoders as dec
|
|
146 |
|:-----|:------|:-----------|:------------|:---------|
|
147 |
| XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
|
148 |
| XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
|
149 |
-
| Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-
|
150 |
-
| Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-
|
151 |
-
| Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-
|
152 |
-
| XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-
|
153 |
|
154 |
**Example Usage for Cross-Objective Models:**
|
155 |
```python
|
@@ -174,9 +179,9 @@ All raw training checkpoints are available in the [jhu-clsp/ettin-checkpoints](h
|
|
174 |
#### HuggingFace Format Checkpoints
|
175 |
Each model repository contains multiple tagged versions representing different training stages:
|
176 |
|
177 |
-
-
|
178 |
-
-
|
179 |
-
-
|
180 |
|
181 |
```python
|
182 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
@@ -209,27 +214,27 @@ This checkpoint availability enables detailed analysis of training dynamics, los
|
|
209 |
|
210 |
Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
|
211 |
|
212 |
-
-
|
213 |
-
-
|
214 |
-
-
|
215 |
-
-
|
216 |
-
-
|
217 |
|
218 |
### Use Cases for Researchers
|
219 |
|
220 |
-
-
|
221 |
-
-
|
222 |
-
-
|
223 |
-
-
|
224 |
-
-
|
225 |
|
226 |
### Reproducibility
|
227 |
|
228 |
All training artifacts are publicly available:
|
229 |
-
-
|
230 |
-
-
|
231 |
-
-
|
232 |
-
-
|
233 |
|
234 |
## Training Details
|
235 |
|
@@ -238,14 +243,14 @@ All training artifacts are publicly available:
|
|
238 |
**Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
|
239 |
|
240 |
**Training Phases:**
|
241 |
-
-
|
242 |
-
-
|
243 |
-
-
|
244 |
|
245 |
**Key Features:**
|
246 |
-
-
|
247 |
-
-
|
248 |
-
-
|
249 |
|
250 |
## Model Architecture
|
251 |
|
@@ -262,7 +267,7 @@ All training artifacts are publicly available:
|
|
262 |
|
263 |
### Encoder: Masked Language Modeling
|
264 |
<details>
|
265 |
-
<summary>Click to expand
|
266 |
|
267 |
```python
|
268 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
@@ -296,7 +301,7 @@ print(f"Predictions: {predictions}")
|
|
296 |
### Decoder: Text Generation
|
297 |
|
298 |
<details>
|
299 |
-
<summary>Click to expand
|
300 |
|
301 |
```python
|
302 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
@@ -783,7 +788,8 @@ def main():
|
|
783 |
model.push_to_hub(run_name)
|
784 |
except Exception:
|
785 |
logging.error(
|
786 |
-
f"Error uploading model to the Hugging Face Hub
|
|
|
787 |
f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
|
788 |
f"and saving it using `model.push_to_hub('{run_name}')`."
|
789 |
)
|
|
|
1 |
---
|
|
|
2 |
language:
|
3 |
- en
|
4 |
+
license: mit
|
5 |
+
pipeline_tag: feature-extraction
|
6 |
+
library_name: transformers
|
7 |
+
tags:
|
8 |
+
- modernbert
|
9 |
+
- encoder
|
10 |
---
|
11 |
+
|
12 |
# Ettin: an Open Suite of Paired Encoders and Decoders
|
13 |
|
14 |
[](https://opensource.org/licenses/MIT)
|
|
|
87 |
|
88 |
Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
|
89 |
|
90 |
+
1. **Identical training data** - Same high-quality mixture across all models
|
91 |
+
2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
|
92 |
+
3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
|
93 |
+
4. **Consistent training recipe** - Three-phase training with 2T tokens
|
94 |
+
5. **Multiple scales** - From 17M to 1B parameters
|
95 |
|
96 |
This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
|
97 |
|
|
|
99 |
|
100 |
The training data is publicly available and split across different phases:
|
101 |
|
102 |
+
- **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
|
103 |
+
- **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
|
104 |
+
- **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
|
105 |
+
- **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
|
106 |
|
107 |
## Model Family
|
108 |
|
|
|
151 |
|:-----|:------|:-----------|:------------|:---------|
|
152 |
| XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
|
153 |
| XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
|
154 |
+
| Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-68m) |
|
155 |
+
| Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-150m) |
|
156 |
+
| Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-400m) |
|
157 |
+
| XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | 1B | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-1b) |
|
158 |
|
159 |
**Example Usage for Cross-Objective Models:**
|
160 |
```python
|
|
|
179 |
#### HuggingFace Format Checkpoints
|
180 |
Each model repository contains multiple tagged versions representing different training stages:
|
181 |
|
182 |
+
- **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
|
183 |
+
- **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
|
184 |
+
- **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
|
185 |
|
186 |
```python
|
187 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
214 |
|
215 |
Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
|
216 |
|
217 |
+
- **Identical Training Data**: Same 2T token mixture across all models
|
218 |
+
- **Matched Architectures**: Only attention patterns and objectives differ
|
219 |
+
- **Open Everything**: Training data, model weights, and batch-level training order
|
220 |
+
- **Multiple Scales**: Fair comparison from 17M to 1B parameters
|
221 |
+
- **250+ Checkpoints**: Complete training trajectory analysis
|
222 |
|
223 |
### Use Cases for Researchers
|
224 |
|
225 |
+
- **Architecture Studies**: Compare encoder vs decoder capabilities fairly
|
226 |
+
- **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
|
227 |
+
- **Scaling Laws**: Study how architectural advantages change with scale
|
228 |
+
- **Transfer Learning**: Investigate cross-objective training effectiveness
|
229 |
+
- **Replication Studies**: First open replication of ModernBERT training recipe
|
230 |
|
231 |
### Reproducibility
|
232 |
|
233 |
All training artifacts are publicly available:
|
234 |
+
- Training data with exact batch ordering
|
235 |
+
- Model checkpoints every 8.5B tokens
|
236 |
+
- Complete hyperparameter configurations
|
237 |
+
- Training code and evaluation scripts
|
238 |
|
239 |
## Training Details
|
240 |
|
|
|
243 |
**Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
|
244 |
|
245 |
**Training Phases:**
|
246 |
+
- **Pre-training**: 1.7T tokens with diverse data mixture
|
247 |
+
- **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
|
248 |
+
- **Decay phase**: 100B tokens with premium data sources
|
249 |
|
250 |
**Key Features:**
|
251 |
+
- Context length: Up to 8K tokens
|
252 |
+
- Vocabulary: 50,368 tokens (ModernBERT tokenizer)
|
253 |
+
- Deep but efficient architectures following MobileLLM principles
|
254 |
|
255 |
## Model Architecture
|
256 |
|
|
|
267 |
|
268 |
### Encoder: Masked Language Modeling
|
269 |
<details>
|
270 |
+
<summary>Click to expand **encoder** usage examples</summary>
|
271 |
|
272 |
```python
|
273 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
|
|
301 |
### Decoder: Text Generation
|
302 |
|
303 |
<details>
|
304 |
+
<summary>Click to expand **decoder text generation**</summary>
|
305 |
|
306 |
```python
|
307 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
788 |
model.push_to_hub(run_name)
|
789 |
except Exception:
|
790 |
logging.error(
|
791 |
+
f"Error uploading model to the Hugging Face Hub:
|
792 |
+
{traceback.format_exc()}To upload it manually, you can run "
|
793 |
f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
|
794 |
f"and saving it using `model.push_to_hub('{run_name}')`."
|
795 |
)
|