arxiv:2507.14958

MUR: Momentum Uncertainty guided Reasoning for Large Language Models

Published on Jul 20

· Submitted by

xufangzhi on Jul 25

#3 Paper of the day

Upvote

Authors:

Abstract

Momentum Uncertainty-guided Reasoning (MUR) dynamically optimizes reasoning budgets in Large Language Models during inference, reducing computation and enhancing accuracy.

AI-generated summary

Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.

View arXiv page View PDF Project page GitHub 35 Add to collection

Community

xufangzhi

Paper submitter 3 days ago

jijivski

1 day ago

🙏 Clarification Questions

Hello, I’m not deeply familiar with this research area, so I may have misunderstood certain points. I appreciate your help in clarifying the following:

Performance of MUR vs. Per‑Step Scale

Why does MUR outperform “Per‑Step Scale,” even though Per‑Step Scale applies full scaling on every step?In Figure 4, the dashed line representing Per‑Step Scale accuracy (i.e., an upper‑bound baseline) falls below the MUR curve.Did you analyze reasons for this phenomenon?For example, is it possible that MUR can scale multiple times per step, or is Per‑Step Scale strictly scaling only once per step?

Number of Reasoning Steps: MUR vs. CoT vs. Per‑Step Scale

MUR appears to use fewer average reasoning steps than standard CoT, and even fewer in the case of Per‑Step Scale(Figure 5) Why? and I believe "Per-Step Scale Accuracy" in Fig 5 is a typo of "Per-Step Scale"?

How are reasoning steps defined and segmented in your experiments?

Thank you very much for your time and assistance—I really appreciate your help in understanding these points.