🧠 Parallel-R1-Unseen_Step_200

Mid-Training Checkpoint of Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Stage: After 200 RL steps via alternating rewards — showing the adaptive parallel reasoning ability and serve as structure exploration stage.

This checkpoint aims to help you reproduce experimental results in Section 4.5: Extra Bonus: Parallel Thinking as a Mid-Training Exploration Strategy for RL Training.

Downloads last month: 414

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Parallel-R1
/

Parallel-R1-Unseen_Step_200

🧠 Parallel-R1-Unseen_Step_200

Dataset used to train Parallel-R1/Parallel-R1-Unseen_Step_200