Papers
arxiv:2507.01951

Test-Time Scaling with Reflective Generative Model

Published on Jul 2
· Submitted by wangyuxin87 on Jul 14
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,

Abstract

MetaStone-S1, a reflective generative model using a self-supervised process reward model, achieves efficient reasoning and scalable performance with fewer parameters compared to existing models.

AI-generated summary

We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.

Community

Paper author Paper submitter

We introduce MetaStone-S1, a pioneering reflective generative model designed to significantly enhance test-time scaling (TTS) capabilities through the new reflective generative form. This work provides three major contributions:

  1. Reflective Generative Form: By sharing backbone between the policy and process reward model(PRM), we develop a unified interface that efficiently integrates reasoning and evaluation processes, introducing only 53M parameters' PRM for efficient inference.
  2. Self-supervised Process Reward Model: We introduce a novel self-supervised learning strategy that dynamically assigns outcome rewards to individual reasoning steps without the need of process-level annotations.
  3. Scaling Law and aha-moment: We empirically demonstrate the scaling law between reasoning computation and TTS performance, and find the aha-moment of the Reflective Generative Form. Extensive evaluations on benchmarks such as AIME24, AIME25, LiveCodeBench, and C-EVAL show that MetaStone-S1 consistently achieves state-of-the-art performance compared to larger open-source and closed-source models.
    To foster community-driven research, we have open-sourced MetaStone-S1. Code, models, and resources are available at https://github.com/MetaStone-AI/MetaStone-S1.

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.01951 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.01951 in a Space README.md to link it from this page.

Collections including this paper 10