Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning

Affiliation

KRAFTON & SKT

Overview

In this post, we explore a new approach to enhancing the reasoning capabilities of LLMs through continual post-training. While pre-training equips LLMs with broad linguistic knowledge, it often falls short in complex reasoning tasks like math or code. Recent models have shown that Reinforcement Learning with Verifiable Rewards (RLVR) can help bridge this gap, but existing methods rely on slow and limited online training. We propose an offline alternative using teacher-generated trajectories and introduce a novel variant of Group Relative Policy Optimization (GRPO) that better captures high-quality reasoning traces—even when all outputs are positive. Our experiments on mathematical reasoning show that this method leads to consistent improvements.

For more details, please refer to our blog

Results

Model	Method	AIME25	AMC23	LiveCodeBench	GPQA-Diamond	IFEval
Openthinker3-7B	Base	57.2915	92.617	63.968	50.947	50.09
	Offline GRPO (+bias)	59.5315	93.516	64.995	49.684	51.66
Openthinker2-7B	Base	39.792	88.633	56.115	45.833	53.3
	Offline GRPO (+bias)	40.3645	87.656	55.944	46.843	52.20
AceReason-Nemetron-1.1-7B	Base	64.635	92.93	72.383	52.462	36.02
	Offline GRPO (+bias)	65.521	93.164	72.603	54.356	38.23

KRAFTON
/

OpenThinker3-Offline-GRPO-7B

Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning

Affiliation

Overview

Results

Collection including KRAFTON/OpenThinker3-Offline-GRPO-7B

Offline GRPO