EN-T5-Sci
Continued-pretrained T5-base on a cleaned English scientific corpus derived from Unpaywall.
Checkpoint: pretraining_logs_lr_001_OPTIMIZED_clean_restart/.../step-487500-val_ppl-3.72168.ckpt (see conversion_info.json for provenance).
Model Details
- Architecture: T5-base (12L / 12L, d_model=768, 220M params)
- Objective: Span corruption (15 % noise, mean span length 3)
- Sequence prep: Sliding windows of 512 tokens with 50 % overlap
- Optimizer: Adafactor with linear warmup (20k) → inverse sqrt decay, lr=1e-3, grad clip=1.0
- Hardware: 4× NVIDIA H100 (mixed precision, gradient accumulation 2, effective batch 384)
Training Data
English scientific text (approx. 230 GB, ~11 M docs) cleaned with DataTrove and custom regex rules (see thesis Section “Automatic Data Preprocessing”). Tokenization via SentencePiece (original T5 vocab).
Evaluation (Global-MMLU, zero-shot, Global benchmark)
| Metric | EN | DE |
|---|---|---|
| Overall accuracy | 0.2687 | 0.2688 |
| Humanities | 0.2419 | 0.2414 |
| STEM | 0.2851 | 0.2858 |
| Social Sciences | 0.3107 | 0.3107 |
| Other | 0.2510 | 0.2514 |
Full plots + per-subtask CSV: evaluation_results/scientific_crosslingual_transfer_eval_full_15k/.
Intended Use
Zero-shot scientific QA, warm-start for downstream fine-tuning on English scientific NLP tasks. Use T5ForConditionalGeneration.from_pretrained("rausch/en-t5-sci-continued-pretraining-487k").
Limitations
- Same T5-base context length (512) and tokenization.
- Evaluated only on Global-MMLU EN/DE; other tasks may require finetuning.
- Training corpus is English-only; no guarantees about other languages.
Citation
Please cite the Bachelor’s thesis (link) and Raffel et al. (2020) for T5.
- Downloads last month
- 22