Text Generation
Safetensors
English
llama

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

OctoThinker-3B-Hybrid-Zero

The OctoThinker family is built on carefully studied mid-training insights, starting from the Llama-3 family, to create a reinforcement learning–friendly base language model.

OctoThinker-3B-Hybrid-Zero is trained using the R1-Zero-style reinforcement learning technique, starting from OctoThinker-3B-Hybrid-Base without any supervised fine-tuning (SFT).

Training Recipe for OctoThinker-3B-Hybrid-Base

Data Pipeline

Evaluation Results of OctoThinker-3B-Base Series

Note that we adopt the few-shot prompting evaluation for these base language models.

Data Pipeline

RL Training Dynamics of OctoThinker-3B-Zero Series

Data Pipeline

More about OctoThinker

Data Pipeline

Citation

Check out our paper for more details. If you use our models, datasets or find our work useful, please cite

@article{wang2025octothinker,
  title={OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling},
  author={Wang, Zengzhi and Zhou, Fan and Li, Xuefeng and Liu, Pengfei},
  year={2025},
  journal={arXiv preprint arXiv:2506.20512},
  note={Preprint}
}
Downloads last month
336
Safetensors
Model size
3.61B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OctoThinker/OctoThinker-3B-Hybrid-Zero

Finetuned
(262)
this model

Datasets used to train OctoThinker/OctoThinker-3B-Hybrid-Zero

Collection including OctoThinker/OctoThinker-3B-Hybrid-Zero