Qwen3-4B-Instruct-2507 — Capstone MathRL

Fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using a two-stage SFT → GRPO pipeline for mathematical reasoning with calculator tool use.

Author: Mohammad Rafi

Base Model

Model: Qwen/Qwen3-4B-Instruct-2507
Parameters: 4B
Context length: 32k tokens

SFT Adapter — `sft_adapter/`

Parameter	Value
Method	LoRA (Supervised Fine-Tuning)
LoRA rank	32
Epochs	2
Training samples	500
Task	Math reasoning (GSM8K + NuminaMath)
Size	270.92 MB

GRPO Adapter — `grpo_adapter/`

Parameter	Value
Method	GRPO (Group Relative Policy Optimization)
Training samples	400
Group size	8
Learning rate	3e-6
Substeps	1
Curriculum	easy → intermediate → hard
Size	270.92 MB

Recommended: Use grpo_adapter/ — trained through the full SFT + GRPO pipeline.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")

# Load GRPO adapter (recommended)
model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="grpo_adapter")
model = model.merge_and_unload()

# Load SFT adapter only
# model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="sft_adapter")
# model = model.merge_and_unload()

Downloads last month: -

Video Preview

Reinforcement Learning

Model tree for MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(5265)

this model

Qwen3-4B-Instruct-2507 — Capstone MathRL

Base Model

SFT Adapter — sft_adapter/

GRPO Adapter — grpo_adapter/

Usage

Model tree for MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL

SFT Adapter — `sft_adapter/`

GRPO Adapter — `grpo_adapter/`