MUR: Momentum Uncertainty guided Reasoning for Large Language Models
Abstract
Momentum Uncertainty-guided Reasoning (MUR) dynamically optimizes reasoning budgets in Large Language Models during inference, reducing computation and enhancing accuracy.
Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.
Community
Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.
🙏 Clarification Questions
Hello, I’m not deeply familiar with this research area, so I may have misunderstood certain points. I appreciate your help in clarifying the following:
- Performance of MUR vs. Per‑Step Scale
Why does MUR outperform “Per‑Step Scale,” even though Per‑Step Scale applies full scaling on every step?In Figure 4, the dashed line representing Per‑Step Scale accuracy (i.e., an upper‑bound baseline) falls below the MUR curve.Did you analyze reasons for this phenomenon?For example, is it possible that MUR can scale multiple times per step, or is Per‑Step Scale strictly scaling only once per step?
- Number of Reasoning Steps: MUR vs. CoT vs. Per‑Step Scale
MUR appears to use fewer average reasoning steps than standard CoT, and even fewer in the case of Per‑Step Scale(Figure 5) Why? and I believe "Per-Step Scale Accuracy" in Fig 5 is a typo of "Per-Step Scale"?
How are reasoning steps defined and segmented in your experiments?
Thank you very much for your time and assistance—I really appreciate your help in understanding these points.
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper