MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
Overview
The model generates – pairs, where:
<rationale>: structured reasoning describing concept integration and difficulty design.<problem>: a single Olympiad-level mathematical question that admits a verifiable numeric or symbolic answer.
MathSmith-HC combines complexity and consistency as difficulty rewards, producing more stable problems than MathSmith-Hard.
MathSmith Pipeline
The MathSmith framework consists of four main stages:
Concept Collection: Randomly sample concept–explanation pairs from PlanetMath to ensure data independence.
Supervised Fine-tuning (SFT): Train the model on collected concept–explanation pairs to establish foundational understanding.
Reinforcement Learning (RL): Optimize the model using GRPO with rewards based on:
- Structural validity
- Reasoning complexity
- Answer consistency
Weakness-Focused Self-Improvement: Iteratively identify and address model weaknesses by generating targeted problem variants.
Dependence
- Transformers 4.52.4
- Pytorch 2.7.0+cu126
- Datasets 3.6.0
- Tokenizers 0.21.1
Citation
If you find this work useful, please cite:
@article{zhan2025mathsmith,
title={MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy},
author={Zhan, Shaoxiong and Lai, Yanlin and Lu, Ziyu and Lin, Dahua and Yang, Ziqing and Tan, Fei},
journal={arXiv preprint arXiv:2508.05592},
year={2025}
}
- Downloads last month
- 9