--- base_model: Qwen/Qwen3-4B-Instruct-2507 library_name: peft tags: - lora - sft - grpo - reinforcement-learning - math - tool-use --- # Qwen3-4B-Instruct-2507 — Capstone MathRL Fine-tuned from [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) using a two-stage SFT → GRPO pipeline for mathematical reasoning with calculator tool use. **Author:** Mohammad Rafi --- ## Base Model - **Model:** `Qwen/Qwen3-4B-Instruct-2507` - **Parameters:** 4B - **Context length:** 32k tokens --- ## SFT Adapter — `sft_adapter/` | Parameter | Value | |-----------|-------| | Method | LoRA (Supervised Fine-Tuning) | | LoRA rank | 32 | | Epochs | 2 | | Training samples | 500 | | Task | Math reasoning (GSM8K + NuminaMath) | | Size | 270.92 MB | --- ## GRPO Adapter — `grpo_adapter/` | Parameter | Value | |-----------|-------| | Method | GRPO (Group Relative Policy Optimization) | | Training samples | 400 | | Group size | 8 | | Learning rate | 3e-6 | | Substeps | 1 | | Curriculum | easy → intermediate → hard | | Size | 270.92 MB | > **Recommended:** Use `grpo_adapter/` — trained through the full SFT + GRPO pipeline. --- ## Usage ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507") # Load GRPO adapter (recommended) model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="grpo_adapter") model = model.merge_and_unload() # Load SFT adapter only # model = PeftModel.from_pretrained(base, "MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL", subfolder="sft_adapter") # model = model.merge_and_unload() ```