MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

Paper License Python GitHub

Overview

The model generates – pairs, where:

  • <rationale>: structured reasoning describing concept integration and difficulty design.
  • <problem>: a single Olympiad-level mathematical question that admits a verifiable numeric or symbolic answer.

MathSmith-HC combines complexity and consistency as difficulty rewards, producing more stable problems than MathSmith-Hard.


MathSmith Pipeline

The MathSmith framework consists of four main stages:

  1. Concept Collection: Randomly sample concept–explanation pairs from PlanetMath to ensure data independence.

  2. Supervised Fine-tuning (SFT): Train the model on collected concept–explanation pairs to establish foundational understanding.

  3. Reinforcement Learning (RL): Optimize the model using GRPO with rewards based on:

    • Structural validity
    • Reasoning complexity
    • Answer consistency
  4. Weakness-Focused Self-Improvement: Iteratively identify and address model weaknesses by generating targeted problem variants.

Dependence

  • Transformers 4.52.4
  • Pytorch 2.7.0+cu126
  • Datasets 3.6.0
  • Tokenizers 0.21.1

Citation

If you find this work useful, please cite:

@article{zhan2025mathsmith,
  title={MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy},
  author={Zhan, Shaoxiong and Lai, Yanlin and Lu, Ziyu and Lin, Dahua and Yang, Ziqing and Tan, Fei},
  journal={arXiv preprint arXiv:2508.05592},
  year={2025}
}
Downloads last month
9
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jasaxion/MathSmith-HC-Problem-Synthesizer-Qwen3-8B

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
(597)
this model
Quantizations
2 models

Dataset used to train Jasaxion/MathSmith-HC-Problem-Synthesizer-Qwen3-8B