MohammadRafiML's picture
Update model card: base model + SFT + GRPO adapter details
d33f2b6 verified
---
base_model: Qwen/Qwen3-4B-Instruct-2507
library_name: peft
tags:
- lora
- sft
- grpo
- reinforcement-learning
- math
- tool-use
---
# Qwen3-4B-Instruct-2507 β€” Capstone MathRL
Fine-tuned from [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) using a two-stage SFT β†’ GRPO pipeline for mathematical reasoning with calculator tool use.
**Author:** Mohammad Rafi
---
## Base Model
- **Model:** `Qwen/Qwen3-4B-Instruct-2507`
- **Parameters:** 4B
- **Context length:** 32k tokens
---
## SFT Adapter β€” `sft_adapter/`
| Parameter | Value |
|-----------|-------|
| Method | LoRA (Supervised Fine-Tuning) |
| LoRA rank | 32 |
| Epochs | 2 |
| Training samples | 500 |
| Task | Math reasoning (GSM8K + NuminaMath) |
| Size | 270.92 MB |
---
## GRPO Adapter β€” `grpo_adapter/`
| Parameter | Value |
|-----------|-------|
| Method | GRPO (Group Relative Policy Optimization) |
| Training samples | 400 |
| Group size | 8 |
| Learning rate | 3e-6 |
| Substeps | 1 |
| Curriculum | easy β†’ intermediate β†’ hard |
| Size | 270.92 MB |
> **Recommended:** Use `grpo_adapter/` β€” trained through the full SFT + GRPO pipeline.
---
## Usage
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
# Load GRPO adapter (recommended)
model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="grpo_adapter")
model = model.merge_and_unload()
# Load SFT adapter only
# model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="sft_adapter")
# model = model.merge_and_unload()
```