arXiv:2511.01937

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

Published on Nov 2

· Submitted by

Abdelaziz Bounhar on Nov 5

MBZUAI-IFM Paris Lab

Upvote

Authors:

Abdelaziz Bounhar ,

Guokan Shang

Abstract

Retaining and up-weighting moderately easy problems in RLVR pipelines for LLMs reduces output verbosity without explicit length penalization.

AI-generated summary

Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out ``easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a model that conflates ``thinking longer'' with ``thinking better''. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \emph{emergent brevity for free}: the model learns to solve harder problems without inflating the output length, despite the absence of any explicit length penalization. RLVR experiments using this approach on Qwen3-4B-Thinking-2507 (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at https://github.com/MBZUAI-Paris/Frugal-AI{GitHub}, with datasets and models on https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc{Hugging Face}.

View arXiv page View PDF GitHub 7 Add to collection

Community

BounharAbdelaziz

Paper author Paper submitter 2 days ago

TL;DR: 🤖 Faster. Smarter. Frugal. and BETTER!
Our open-source RL-trained math model reduces verbosity by ~2× without losing accuracy (actually improving on some hard reasoning benchmarks like Omni-Hard) showing that easy problems can implicitly regularize length during RL.

Code is publicly available on Github.

Model and Data are publicly available on Hugging Face.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.01937 in a Space README.md to link it from this page.