GPT-2 70M - Optimal Dataset Mixing

A 70M parameter GPT-2 model trained on 1 billion tokens using an optimized 50-30-20 dataset mixing strategy.

Model Description

This model demonstrates the effectiveness of careful dataset composition for efficient language model pretraining. Despite using 10x less training data than GPT-2 (1B vs 10B tokens), it achieves competitive performance by leveraging an optimal mixture of high-quality data sources.

Architecture: GPT-2

Parameters: 70M (64.09M trainable)
Layers: 12
Hidden Size: 512
Attention Heads: 8
Context Length: 1024 tokens
Vocabulary Size: 50,257

Training Data

The model was trained on 1 billion tokens with the following composition:

50% - FinePDFs (500M tokens): High-quality PDF content
30% - DCLM Baseline (300M tokens): Filtered web content
20% - FineWeb-Edu (200M tokens): Educational web content

This 50-30-20 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.

Training Details

Total Tokens: 1,000,000,000
Batch Size: 24 (effective: 120 with gradient accumulation)
Learning Rate: 5e-4 → 5e-5 (cosine decay)
Warmup Steps: 162 (2% of total)
Precision: BFloat16
Optimizer: AdamW
Final Loss: 2.92

Benchmark Results

Performance Comparison

Benchmark	Our Model	Random	GPT-2	vs Random	vs GPT-2
MMLU (5-shot)	24.11%	25.00%	26.00%	-0.89%	-1.89%
HellaSwag (0-shot)	27.03%	25.00%	30.00%	+2.03%	-2.97%
ARC-Challenge (0-shot)	21.67%	25.00%	24.00%	-3.33%	-2.33%
PIQA (0-shot)	57.29%	50.00%	63.00%	+7.29%	-5.71%
WinoGrande (0-shot)	51.46%	50.00%	51.00%	+1.46%	+0.46%
TruthfulQA MC2 (0-shot)	47.31%	25.00%	40.00%	+22.31%	+7.31%
Average	38.15%	33.33%	39.00%	+4.81%	-0.85%

Key Findings

Performance Gap: Only 0.85% behind GPT-2 baseline (39.00%)
Efficiency: Achieves 84.9% of GPT-2's performance improvement over random guessing
Data Efficiency: Competitive results with 10x less training data
TruthfulQA Excellence: +7.31% above GPT-2 baseline, demonstrating superior factual accuracy

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("codelion/gpt-2-70m")
model = AutoModelForCausalLM.from_pretrained("codelion/gpt-2-70m")

# Generate text with better sampling parameters
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=50,
    do_sample=True,           # Enable sampling
    temperature=0.8,          # Control randomness
    top_p=0.9,               # Nucleus sampling
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))

Key Insights

Data Quality > Quantity: The 50-30-20 mixing strategy demonstrates that careful dataset composition can achieve strong performance with significantly reduced compute
Factual Accuracy: The model excels at truthfulness (TruthfulQA), likely due to high-quality FinePDF content (50%)
Practical Commonsense: Strong performance on PIQA and WinoGrande shows effective real-world reasoning
Knowledge Gaps: Below-random performance on MMLU and ARC-Challenge indicates insufficient academic/scientific knowledge for this scale

Limitations

Academic Knowledge: Limited performance on academic benchmarks (MMLU, ARC-Challenge)
Training Scale: 1B tokens is insufficient for comprehensive world knowledge
Parameter Count: 70M parameters may limit capacity for complex reasoning

Citation

If you use this model/dataset, please cite:

@article{sharma2025billion,
  title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
  author={Sharma, Asankhaya},
  year={2025},
  url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}

For more details, see the blog post.

Model Card Authors

codelion

Model Card Contact

For questions or issues, please open an issue on the model repository.

Downloads last month: 1,171

Safetensors

Model size

64.1M params

Tensor type

F32

Datasets used to train codelion/gpt-2-70m

Collection including codelion/gpt-2-70m

Pre-training Dataset Samples

Collection

A collection of pre-training datasets samples of sizes 10M, 100M and 1B tokens. Ideal for use in quick experimentation and ablations. • 19 items • Updated about 12 hours ago • 12

Evaluation results

MMLU (5-shot)
self-reported

24.110
HellaSwag (0-shot)
self-reported

27.030
ARC-Challenge (0-shot)
self-reported

21.670
PIQA (0-shot)
self-reported

57.290
WinoGrande (0-shot)
self-reported

51.460
TruthfulQA MC2 (0-shot)
self-reported

47.310
Average
self-reported

38.150

Metadata error: specify a dataset to view leaderboard