Papers
arxiv:2509.22611

Quantile Advantage Estimation for Entropy-Safe Reasoning

Published on Sep 26
ยท Submitted by Junkang Wu on Sep 29
#2 Paper of the day
Authors:
,
,

Abstract

Quantile Advantage Estimation stabilizes reinforcement learning with verifiable rewards by addressing entropy issues and improving performance on large language models.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.

Community

Paper author Paper submitter

Problem. In value-free RL for LLM reasoning (e.g., GRPO/DAPO), training often oscillates between entropy explosion (over-random updates driven by negative advantages) and entropy collapse (premature determinism), hurting scaling.

Observation. The group mean baseline is brittle under reward outliers: it inflates the baseline and turns many plausible responses into negative advantage, amplifying instability.

Method (QAE). Replace the mean with a K-quantile baseline per query group. This induces a two-regime gate:

  • Hard queries (low success rate): reinforce rare successes only.
  • Easy queries (high success rate): penalize residual failures only.

A single (K) controls how many responses receive non-zero advantage, balancing exploration/exploitation and yielding two-sided entropy safety under first-order softmax updates.

entropy_dynamics

main_table

sparsity_adv

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.22611 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.22611 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.22611 in a Space README.md to link it from this page.

Collections including this paper 1