MUR: Momentum Uncertainty guided Reasoning for Large Language Models
Abstract
Momentum Uncertainty-guided Reasoning (MUR) dynamically allocates computational resources to improve reasoning efficiency and accuracy in Large Language Models without additional training.
Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.
Community
Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.
🙏 Clarification Questions
Hello, I’m not deeply familiar with this research area, so I may have misunderstood certain points. I appreciate your help in clarifying the following:
- Performance of MUR vs. Per‑Step Scale
Why does MUR outperform “Per‑Step Scale,” even though Per‑Step Scale applies full scaling on every step?In Figure 4, the dashed line representing Per‑Step Scale accuracy (i.e., an upper‑bound baseline) falls below the MUR curve.Did you analyze reasons for this phenomenon?For example, is it possible that MUR can scale multiple times per step, or is Per‑Step Scale strictly scaling only once per step?
- Number of Reasoning Steps: MUR vs. CoT vs. Per‑Step Scale
MUR appears to use fewer average reasoning steps than standard CoT, and even fewer in the case of Per‑Step Scale(Figure 5) Why? and I believe "Per-Step Scale Accuracy" in Fig 5 is a typo of "Per-Step Scale"?
How are reasoning steps defined and segmented in your experiments?
Thank you very much for your time and assistance—I really appreciate your help in understanding these points.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework (2025)
- Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS (2025)
- LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models (2025)
- TL;DR: Too Long, Do Re-weighting for Effcient LLM Reasoning Compression (2025)
- Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning (2025)
- KV Cache Steering for Inducing Reasoning in Small Language Models (2025)
- Large Reasoning Models are not thinking straight: on the unreliability of thinking trajectories (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper