metadata
base_model: Qwen/Qwen3-4B-Instruct-2507
library_name: peft
tags:
- lora
- sft
- grpo
- reinforcement-learning
- math
- tool-use
Qwen3-4B-Instruct-2507 — Capstone MathRL
Fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using a two-stage SFT → GRPO pipeline for mathematical reasoning with calculator tool use.
Author: Mohammad Rafi
Base Model
- Model:
Qwen/Qwen3-4B-Instruct-2507 - Parameters: 4B
- Context length: 32k tokens
SFT Adapter — sft_adapter/
| Parameter | Value |
|---|---|
| Method | LoRA (Supervised Fine-Tuning) |
| LoRA rank | 32 |
| Epochs | 2 |
| Training samples | 500 |
| Task | Math reasoning (GSM8K + NuminaMath) |
| Size | 270.92 MB |
GRPO Adapter — grpo_adapter/
| Parameter | Value |
|---|---|
| Method | GRPO (Group Relative Policy Optimization) |
| Training samples | 400 |
| Group size | 8 |
| Learning rate | 3e-6 |
| Substeps | 1 |
| Curriculum | easy → intermediate → hard |
| Size | 270.92 MB |
Recommended: Use
grpo_adapter/— trained through the full SFT + GRPO pipeline.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
# Load GRPO adapter (recommended)
model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="grpo_adapter")
model = model.merge_and_unload()
# Load SFT adapter only
# model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="sft_adapter")
# model = model.merge_and_unload()