---
base_model: Qwen/Qwen3-4B-Instruct-2507
library_name: peft
tags:
  - lora
  - sft
  - grpo
  - reinforcement-learning
  - math
  - tool-use
---

# Qwen3-4B-Instruct-2507 — Capstone MathRL

Fine-tuned from [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) using a two-stage SFT → GRPO pipeline for mathematical reasoning with calculator tool use.

**Author:** Mohammad Rafi

---

## Base Model

- **Model:** `Qwen/Qwen3-4B-Instruct-2507`
- **Parameters:** 4B
- **Context length:** 32k tokens

---

## SFT Adapter — `sft_adapter/`

| Parameter | Value |
|-----------|-------|
| Method | LoRA (Supervised Fine-Tuning) |
| LoRA rank | 32 |
| Epochs | 2 |
| Training samples | 500 |
| Task | Math reasoning (GSM8K + NuminaMath) |
| Size | 270.92 MB |

---

## GRPO Adapter — `grpo_adapter/`

| Parameter | Value |
|-----------|-------|
| Method | GRPO (Group Relative Policy Optimization) |
| Training samples | 400 |
| Group size | 8 |
| Learning rate | 3e-6 |
| Substeps | 1 |
| Curriculum | easy → intermediate → hard |
| Size | 270.92 MB |

> **Recommended:** Use `grpo_adapter/` — trained through the full SFT + GRPO pipeline.

---

## Usage

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")

# Load GRPO adapter (recommended)
model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="grpo_adapter")
model = model.merge_and_unload()

# Load SFT adapter only
# model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="sft_adapter")
# model = model.merge_and_unload()
```