Abstract
Deep Self-Evolving Reasoning (DSER) extends the reasoning capabilities of smaller models by iteratively improving solutions through a probabilistic Markov chain, enabling them to solve previously unsolvable problems and surpass larger models in accuracy.
Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.
Community
How can a small model reason like a giant? With Deep Self-Evolving Reasoning (DSER), a new paradigm that reframes iterative verification and refinement as a stochastic process. By running multiple parallel, long-horizon "self-evolutions", DSER guides the model to naturally converge on correct solutions. This enabled the compact DeepSeek-R1-0528-Qwen3-8B to solve 5 out of 9 previously unsolvable AIME problems, achieving performance that rivals its 600B-parameter teacher, DeepSeek-R1-0528.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs (2025)
- From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs (2025)
- THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning (2025)
- ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute (2025)
- Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm (2025)
- Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization (2025)
- Socratic-Zero : Bootstrapping Reasoning via Data-Free Agent Co-evolution (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper