sauravlmx/qwen-2.5-1.5b-rlvr-ppo

This model is a fine-tuned version of ns-0/qwen-2.5-1.5b-instruct-reasoning-sft on the allenai/RLVR-GSM-MATH-IF-Mixed-Constraints dataset. It is specifically adapted to improve performance on mathematical and reasoning tasks as outlined by the dataset.

Model Details

Base Model: ns-0/qwen-2.5-1.5b-instruct-reasoning-sft
Dataset: allenai/RLVR-GSM-MATH-IF-Mixed-Constraints
Training Method: Reinforcement Learning from Verifiable Rewards (RLVR) using a PPO (Proximal Policy Optimization) trainer.
Framework: https://github.com/allenai/open-instruct/tree/main with Ray and vLLM.

Intermediate Checkpoints

main branch: This branch contains the final, fully trained model. (400K episodes)
step-* branches: Each branch named step-* (e.g., step-100, step-200) contains an intermediate checkpoint saved at that specific training step. These are useful for research into model training dynamics etc.

Find logs here : https://wandb.ai/sauravpanigrahi-/open_instruct_internal

sauravlmx
/

qwen-2.5-1.5b-rlvr-ppo

Model tree for sauravlmx/qwen-2.5-1.5b-rlvr-ppo

Dataset used to train sauravlmx/qwen-2.5-1.5b-rlvr-ppo