This model is a fine-tuned version of ns-0/qwen-2.5-1.5b-instruct-reasoning-sft
on the allenai/RLVR-GSM-MATH-IF-Mixed-Constraints
dataset. It is specifically adapted to improve performance on mathematical and reasoning tasks as outlined by the dataset.
Model Details
- Base Model:
ns-0/qwen-2.5-1.5b-instruct-reasoning-sft
- Dataset:
allenai/RLVR-GSM-MATH-IF-Mixed-Constraints
- Training Method: Reinforcement Learning from Verifiable Rewards (RLVR) using a PPO (Proximal Policy Optimization) trainer.
- Framework: https://github.com/allenai/open-instruct/tree/main with Ray and vLLM.
Intermediate Checkpoints
main
branch: This branch contains the final, fully trained model. (400K episodes)step-*
branches: Each branch named step-* (e.g., step-100, step-200) contains an intermediate checkpoint saved at that specific training step. These are useful for research into model training dynamics etc.
Find logs here : https://wandb.ai/sauravpanigrahi-/open_instruct_internal
- Downloads last month
- 22
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for sauravlmx/qwen-2.5-1.5b-rlvr-ppo
Base model
Qwen/Qwen2.5-1.5B
Finetuned
Qwen/Qwen2.5-1.5B-Instruct