| --- |
| base_model: Qwen/Qwen3-4B-Instruct-2507 |
| library_name: peft |
| tags: |
| - lora |
| - sft |
| - grpo |
| - reinforcement-learning |
| - math |
| - tool-use |
| --- |
| |
| # Qwen3-4B-Instruct-2507 β Capstone MathRL |
|
|
| Fine-tuned from [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) using a two-stage SFT β GRPO pipeline for mathematical reasoning with calculator tool use. |
|
|
| **Author:** Mohammad Rafi |
|
|
| --- |
|
|
| ## Base Model |
|
|
| - **Model:** `Qwen/Qwen3-4B-Instruct-2507` |
| - **Parameters:** 4B |
| - **Context length:** 32k tokens |
|
|
| --- |
|
|
| ## SFT Adapter β `sft_adapter/` |
| |
| | Parameter | Value | |
| |-----------|-------| |
| | Method | LoRA (Supervised Fine-Tuning) | |
| | LoRA rank | 32 | |
| | Epochs | 2 | |
| | Training samples | 500 | |
| | Task | Math reasoning (GSM8K + NuminaMath) | |
| | Size | 270.92 MB | |
| |
| --- |
| |
| ## GRPO Adapter β `grpo_adapter/` |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Method | GRPO (Group Relative Policy Optimization) | |
| | Training samples | 400 | |
| | Group size | 8 | |
| | Learning rate | 3e-6 | |
| | Substeps | 1 | |
| | Curriculum | easy β intermediate β hard | |
| | Size | 270.92 MB | |
|
|
| > **Recommended:** Use `grpo_adapter/` β trained through the full SFT + GRPO pipeline. |
| |
| --- |
| |
| ## Usage |
| |
| ```python |
| from peft import PeftModel |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507") |
| tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507") |
| |
| # Load GRPO adapter (recommended) |
| model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="grpo_adapter") |
| model = model.merge_and_unload() |
| |
| # Load SFT adapter only |
| # model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="sft_adapter") |
| # model = model.merge_and_unload() |
| ``` |
| |