This model is a fine-tuned version of ns-0/qwen-2.5-1.5b-instruct-reasoning-sft on the allenai/RLVR-GSM-MATH-IF-Mixed-Constraints dataset. It is specifically adapted to improve performance on mathematical and reasoning tasks as outlined by the dataset.

Model Details

  • Base Model: ns-0/qwen-2.5-1.5b-instruct-reasoning-sft
  • Dataset: allenai/RLVR-GSM-MATH-IF-Mixed-Constraints
  • Training Method: Reinforcement Learning from Verifiable Rewards (RLVR) using a PPO (Proximal Policy Optimization) trainer.
  • Framework: https://github.com/allenai/open-instruct/tree/main with Ray and vLLM.

Intermediate Checkpoints

  • main branch: This branch contains the final, fully trained model. (400K episodes)
  • step-* branches: Each branch named step-* (e.g., step-100, step-200) contains an intermediate checkpoint saved at that specific training step. These are useful for research into model training dynamics etc.

Find logs here : https://wandb.ai/sauravpanigrahi-/open_instruct_internal

Downloads last month
22
Safetensors
Model size
1.54B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sauravlmx/qwen-2.5-1.5b-rlvr-ppo

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(1)
this model

Dataset used to train sauravlmx/qwen-2.5-1.5b-rlvr-ppo