Papers
arxiv:2506.05333

Kinetics: Rethinking Test-Time Scaling Laws

Published on Jun 5
· Submitted by ZMC2019 on Jun 6
Authors:
,
,
,
,
,

Abstract

Inference with small models is less efficient due to memory bottlenecks, leading to a new Kinetics Scaling Law emphasizing sparse attention for better test-time performance.

AI-generated summary

We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-N, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training, where parameter scaling saturates, test-time accuracy continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

Community

Paper submitter
•
edited Jun 6

🥳 Happy to share our new work – Kinetics: Rethinking Test-Time Scaling Laws
🤔How to effectively build a powerful reasoning agent?
Existing compute-optimal scaling laws suggest 64K thinking tokens + 1.7B model > 32B model.
But, It only shows half of the picture!
🚨 The O(N²) KV memory access in self-attention dominates the cost of test-time scaling (TTS).
MoEs even worsen memory bottleneck by cutting compute.
Our new scaling law, Kinetics, suggests - invest in model size first before spending more in test-time compute.
This insight leads to our next key finding
✨ Sparse Attention = Scalable TTS
Our Kinetics sparse scaling law says that when doubling the resources, we should prioritize increasing test time tokens over attention density.
✅ 60+ points improvement under the same compute budget
✅ 10× lower resource usage for equivalent performance
✅ Sparse attention becomes increasingly valuable in high-cost scenarios
💡Sparsity is key to unlocking full potential of TTS, because unlike pretraining, where scaling shows diminishing returns, TTS continues to benefit from increased token generation and more optimized inference paths.

Arxiv link: https://arxiv.org/abs/2506.05333
Website: https://infini-ai-lab.github.io/Kinetics/
Twitter: https://x.com/InfiniAILab/status/1931053042876768586
papers.png

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.05333 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.05333 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.