arxiv:2509.22611

Quantile Advantage Estimation for Entropy-Safe Reasoning

Published on Sep 26

· Submitted by

Junkang Wu on Sep 29

#2 Paper of the day

Upvote

116

Authors:

Junkang Wu ,

Kexin Huang ,

Xiang Wang ,

Abstract

Quantile Advantage Estimation stabilizes reinforcement learning with verifiable rewards by addressing entropy issues and improving performance on large language models.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.

View arXiv page View PDF GitHub 15 Add to collection

Community

junkang0909

Paper author Paper submitter 11 days ago

Problem. In value-free RL for LLM reasoning (e.g., GRPO/DAPO), training often oscillates between entropy explosion (over-random updates driven by negative advantages) and entropy collapse (premature determinism), hurting scaling.

Observation. The group mean baseline is brittle under reward outliers: it inflates the baseline and turns many plausible responses into negative advantage, amplifying instability.

Method (QAE). Replace the mean with a K-quantile baseline per query group. This induces a two-regime gate:

Hard queries (low success rate): reinforce rare successes only.
Easy queries (high success rate): penalize residual failures only.

A single (K) controls how many responses receive non-zero advantage, balancing exploration/exploitation and yielding two-sided entropy safety under first-order softmax updates.