GPT-2 70M - Optimal Dataset Mixing

A 70M parameter GPT-2 model trained on 1 billion tokens using an optimized 50-30-20 dataset mixing strategy.

Model Description

This model demonstrates the effectiveness of careful dataset composition for efficient language model pretraining. Despite using 10x less training data than GPT-2 (1B vs 10B tokens), it achieves competitive performance by leveraging an optimal mixture of high-quality data sources.

Architecture: GPT-2

  • Parameters: 70M (64.09M trainable)
  • Layers: 12
  • Hidden Size: 512
  • Attention Heads: 8
  • Context Length: 1024 tokens
  • Vocabulary Size: 50,257

Training Data

The model was trained on 1 billion tokens with the following composition:

  • 50% - FinePDFs (500M tokens): High-quality PDF content
  • 30% - DCLM Baseline (300M tokens): Filtered web content
  • 20% - FineWeb-Edu (200M tokens): Educational web content

This 50-30-20 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.

Training Details

  • Total Tokens: 1,000,000,000
  • Batch Size: 24 (effective: 120 with gradient accumulation)
  • Learning Rate: 5e-4 → 5e-5 (cosine decay)
  • Warmup Steps: 162 (2% of total)
  • Precision: BFloat16
  • Optimizer: AdamW
  • Final Loss: 2.92

Benchmark Results

Performance Comparison

Benchmark Our Model Random GPT-2 vs Random vs GPT-2
MMLU (5-shot) 24.11% 25.00% 26.00% -0.89% -1.89%
HellaSwag (0-shot) 27.03% 25.00% 30.00% +2.03% -2.97%
ARC-Challenge (0-shot) 21.67% 25.00% 24.00% -3.33% -2.33%
PIQA (0-shot) 57.29% 50.00% 63.00% +7.29% -5.71%
WinoGrande (0-shot) 51.46% 50.00% 51.00% +1.46% +0.46%
TruthfulQA MC2 (0-shot) 47.31% 25.00% 40.00% +22.31% +7.31%
Average 38.15% 33.33% 39.00% +4.81% -0.85%

Key Findings

  • Performance Gap: Only 0.85% behind GPT-2 baseline (39.00%)
  • Efficiency: Achieves 84.9% of GPT-2's performance improvement over random guessing
  • Data Efficiency: Competitive results with 10x less training data
  • TruthfulQA Excellence: +7.31% above GPT-2 baseline, demonstrating superior factual accuracy

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("codelion/gpt-2-70m")
model = AutoModelForCausalLM.from_pretrained("codelion/gpt-2-70m")

# Generate text with better sampling parameters
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=50,
    do_sample=True,           # Enable sampling
    temperature=0.8,          # Control randomness
    top_p=0.9,               # Nucleus sampling
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))

Key Insights

  1. Data Quality > Quantity: The 50-30-20 mixing strategy demonstrates that careful dataset composition can achieve strong performance with significantly reduced compute
  2. Factual Accuracy: The model excels at truthfulness (TruthfulQA), likely due to high-quality FinePDF content (50%)
  3. Practical Commonsense: Strong performance on PIQA and WinoGrande shows effective real-world reasoning
  4. Knowledge Gaps: Below-random performance on MMLU and ARC-Challenge indicates insufficient academic/scientific knowledge for this scale

Limitations

  • Academic Knowledge: Limited performance on academic benchmarks (MMLU, ARC-Challenge)
  • Training Scale: 1B tokens is insufficient for comprehensive world knowledge
  • Parameter Count: 70M parameters may limit capacity for complex reasoning

Citation

If you use this model/dataset, please cite:

@article{sharma2025billion,
  title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
  author={Sharma, Asankhaya},
  year={2025},
  url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}

For more details, see the blog post.

Model Card Authors

codelion

Model Card Contact

For questions or issues, please open an issue on the model repository.

Downloads last month
1,171
Safetensors
Model size
64.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train codelion/gpt-2-70m

Collection including codelion/gpt-2-70m

Evaluation results