Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning

Affiliation

KRAFTON & SKT

Overview

In this post, we explore a new approach to enhancing the reasoning capabilities of LLMs through continual post-training. While pre-training equips LLMs with broad linguistic knowledge, it often falls short in complex reasoning tasks like math or code. Recent models have shown that Reinforcement Learning with Verifiable Rewards (RLVR) can help bridge this gap, but existing methods rely on slow and limited online training. We propose an offline alternative using teacher-generated trajectories and introduce a novel variant of Group Relative Policy Optimization (GRPO) that better captures high-quality reasoning traces—even when all outputs are positive. Our experiments on mathematical reasoning show that this method leads to consistent improvements.

For more details, please refer to our blog

Results

Model Method AIME25 AMC23 LiveCodeBench GPQA-Diamond IFEval
Openthinker3-7B Base 57.2915 92.617 63.968 50.947 50.09
Offline GRPO (+bias) 59.5315 93.516 64.995 49.684 51.66
Openthinker2-7B Base 39.792 88.633 56.115 45.833 53.3
Offline GRPO (+bias) 40.3645 87.656 55.944 46.843 52.20
AceReason-Nemetron-1.1-7B Base 64.635 92.93 72.383 52.462 36.02
Offline GRPO (+bias) 65.521 93.164 72.603 54.356 38.23
Downloads last month
3
Safetensors
Model size
7.62B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including KRAFTON/OpenThinker3-Offline-GRPO-7B