Qwen3-4B-Instruct-2507 β€” Capstone MathRL

Fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using a two-stage SFT β†’ GRPO pipeline for mathematical reasoning with calculator tool use.

Author: Mohammad Rafi


Base Model

  • Model: Qwen/Qwen3-4B-Instruct-2507
  • Parameters: 4B
  • Context length: 32k tokens

SFT Adapter β€” sft_adapter/

Parameter Value
Method LoRA (Supervised Fine-Tuning)
LoRA rank 32
Epochs 2
Training samples 500
Task Math reasoning (GSM8K + NuminaMath)
Size 270.92 MB

GRPO Adapter β€” grpo_adapter/

Parameter Value
Method GRPO (Group Relative Policy Optimization)
Training samples 400
Group size 8
Learning rate 3e-6
Substeps 1
Curriculum easy β†’ intermediate β†’ hard
Size 270.92 MB

Recommended: Use grpo_adapter/ β€” trained through the full SFT + GRPO pipeline.


Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")

# Load GRPO adapter (recommended)
model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="grpo_adapter")
model = model.merge_and_unload()

# Load SFT adapter only
# model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="sft_adapter")
# model = model.merge_and_unload()
Downloads last month
-
Video Preview
loading

Model tree for MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL

Adapter
(5265)
this model