orionweller commited on
Commit
356892d
·
verified ·
1 Parent(s): b57a230

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +206 -0
README.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: fill-mask
6
+ ---
7
+ # Ettin: Open Suite of Paired Encoders and Decoders
8
+
9
+ 📄 [Paper](https://arxiv.org/abs/XXXX.XXXXX) | 🚀 [GitHub Repository](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
10
+
11
+ This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
12
+
13
+ ## Model Description
14
+
15
+ Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
16
+
17
+ 1. **Identical training data** - Same high-quality mixture across all models
18
+ 2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
19
+ 3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
20
+ 4. **Consistent training recipe** - Three-phase training with 2T tokens
21
+ 5. **Multiple scales** - From 17M to 1B parameters
22
+
23
+ This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
24
+
25
+ ## Training Data
26
+
27
+ The training data is publicly available and split across different phases:
28
+
29
+ - **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
30
+ - **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
31
+ - **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
32
+ - **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
33
+
34
+ ## Model Family
35
+
36
+ ### Encoder Models
37
+ | Model | Parameters | Description |
38
+ |:------|:-----------|:------------|
39
+ | [ettin-encoder-17m](https://huggingface.co/jhu-clsp/ettin-encoder-17m) | 17M | Extra extra small encoder model |
40
+ | [ettin-encoder-32m](https://huggingface.co/jhu-clsp/ettin-encoder-32m) | 32M | Extra small encoder model |
41
+ | [ettin-encoder-68m](https://huggingface.co/jhu-clsp/ettin-encoder-68m) | 68M | Small encoder model |
42
+ | [ettin-encoder-150m](https://huggingface.co/jhu-clsp/ettin-encoder-150m) | 150M | Base encoder model |
43
+ | [ettin-encoder-400m](https://huggingface.co/jhu-clsp/ettin-encoder-400m) | 400M | Large encoder model |
44
+ | [ettin-encoder-1b](https://huggingface.co/jhu-clsp/ettin-encoder-1b) | 1B | Extra large encoder model |
45
+
46
+ ### Decoder Models
47
+ | Model | Parameters | Description |
48
+ |:------|:-----------|:------------|
49
+ | [ettin-decoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-17m) | 17M | Extra extra small decoder model |
50
+ | [ettin-decoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-32m) | 32M | Extra small decoder model |
51
+ | [ettin-decoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-68m) | 68M | Small decoder model |
52
+ | [ettin-decoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-150m) | 150M | Base decoder model |
53
+ | [ettin-decoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-400m) | 400M | Large decoder model |
54
+ | [ettin-decoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | 1B | Extra large decoder model |
55
+
56
+ ### Cross-Objective Models
57
+
58
+ #### Encoders Trained from Decoders (Decoder → MLM)
59
+ | Model | Parameters | Description |
60
+ |:------|:-----------|:------------|
61
+ | [ettin-encoder-from-decoder-17m](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-17m) | 17M | Decoder continued trained with MLM |
62
+ | [ettin-encoder-from-decoder-32m](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-32m) | 32M | Decoder continued trained with MLM |
63
+ | [ettin-encoder-from-decoder-68m](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-68m) | 68M | Decoder continued trained with MLM |
64
+ | [ettin-encoder-from-decoder-150m](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-150m) | 150M | Decoder continued trained with MLM |
65
+ | [ettin-encoder-from-decoder-400m](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-400m) | 400M | Decoder continued trained with MLM |
66
+ | [ettin-encoder-from-decoder-1b](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-1b) | 1B | Decoder continued trained with MLM |
67
+
68
+ #### Decoders Trained from Encoders (Encoder → CLM)
69
+ | Model | Parameters | Description |
70
+ |:------|:-----------|:------------|
71
+ | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder continued trained with CLM |
72
+ | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder continued trained with CLM |
73
+ | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder continued trained with CLM |
74
+ | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder continued trained with CLM |
75
+ | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder continued trained with CLM |
76
+ | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) | 1B | Encoder continued trained with CLM |
77
+
78
+ ## Usage
79
+
80
+ ### Encoder Models (Classification/Retrieval/MLM)
81
+
82
+ ```python
83
+ from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
84
+ import torch
85
+
86
+ # Load model and tokenizer
87
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/{MODEL_NAME}")
88
+ model = AutoModel.from_pretrained("jhu-clsp/{MODEL_NAME}")
89
+
90
+ # Example: Text classification/embeddings
91
+ def encode_text(text):
92
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
93
+ with torch.no_grad():
94
+ outputs = model(**inputs)
95
+ # Use [CLS] token representation
96
+ embeddings = outputs.last_hidden_state[:, 0, :]
97
+ return embeddings
98
+
99
+ # Example: Masked Language Modeling
100
+ mlm_model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/{MODEL_NAME}")
101
+
102
+ def predict_masked_token(text):
103
+ # Text should contain [MASK] token
104
+ inputs = tokenizer(text, return_tensors="pt")
105
+
106
+ with torch.no_grad():
107
+ outputs = mlm_model(**inputs)
108
+ predictions = outputs.logits
109
+
110
+ # Get predictions for masked tokens
111
+ masked_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
112
+ masked_predictions = predictions[masked_indices]
113
+
114
+ # Get top predictions
115
+ top_predictions = torch.topk(masked_predictions, 5, dim=-1)
116
+ predicted_tokens = [tokenizer.decode(token_id) for token_id in top_predictions.indices[0]]
117
+
118
+ return predicted_tokens
119
+
120
+ # Example usage
121
+ text = "This is a sample text for encoding."
122
+ embeddings = encode_text(text)
123
+ print(f"Embedding shape: {embeddings.shape}")
124
+
125
+ # MLM example
126
+ masked_text = "The capital of France is [MASK]."
127
+ predictions = predict_masked_token(masked_text)
128
+ print(f"Predictions: {predictions}")
129
+ ```
130
+
131
+ ### Decoder Models (Text Generation)
132
+
133
+ ```python
134
+ from transformers import AutoTokenizer, AutoModelForCausalLM
135
+
136
+ # Load model and tokenizer
137
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/{MODEL_NAME}")
138
+ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/{MODEL_NAME}")
139
+
140
+ # Set pad token if not already set
141
+ if tokenizer.pad_token is None:
142
+ tokenizer.pad_token = tokenizer.eos_token
143
+
144
+ # Generate text
145
+ def generate_text(prompt, max_length=100):
146
+ inputs = tokenizer(prompt, return_tensors="pt")
147
+
148
+ with torch.no_grad():
149
+ outputs = model.generate(
150
+ inputs.input_ids,
151
+ max_length=max_length,
152
+ num_return_sequences=1,
153
+ temperature=0.7,
154
+ do_sample=True,
155
+ pad_token_id=tokenizer.eos_token_id
156
+ )
157
+
158
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
159
+ return generated_text
160
+
161
+ # Example usage
162
+ prompt = "The future of artificial intelligence is"
163
+ generated = generate_text(prompt)
164
+ print(generated)
165
+ ```
166
+
167
+ ## Training Details
168
+
169
+ **Data:** High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens
170
+
171
+ **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
172
+
173
+ **Training Phases:**
174
+ - **Pre-training**: 1.7T tokens with diverse data mixture
175
+ - **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
176
+ - **Decay phase**: 100B tokens with premium data sources
177
+
178
+ **Key Features:**
179
+ - Context length: Up to 8K tokens
180
+ - Vocabulary: 50,368 tokens (ModernBERT tokenizer)
181
+ - Deep but efficient architectures following MobileLLM principles
182
+
183
+ ## Model Architecture
184
+
185
+ | Parameter | 17M | 32M | 68M | 150M | 400M | 1B |
186
+ |:----------|:----|:----|:----|:-----|:-----|:---|
187
+ | Layers | 7 | 10 | 19 | 22 | 28 | 28 |
188
+ | Hidden Size | 256 | 384 | 512 | 768 | 1024 | 1792 |
189
+ | Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
190
+ | Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
191
+
192
+ ## Citation
193
+
194
+ If you use Ettin models in your research, please cite our work:
195
+
196
+ ```bibtex
197
+ @misc{weller2025seqvsseq,
198
+ title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders},
199
+ author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
200
+ year={2025},
201
+ eprint={XXXX.XXXXX},
202
+ archivePrefix={arXiv},
203
+ primaryClass={cs.CL},
204
+ url={https://arxiv.org/abs/XXXX.XXXXX},
205
+ }
206
+ ```