Abstract
Quantile Advantage Estimation stabilizes reinforcement learning with verifiable rewards by addressing entropy issues and improving performance on large language models.
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.
Community
Problem. In value-free RL for LLM reasoning (e.g., GRPO/DAPO), training often oscillates between entropy explosion (over-random updates driven by negative advantages) and entropy collapse (premature determinism), hurting scaling.
Observation. The group mean baseline is brittle under reward outliers: it inflates the baseline and turns many plausible responses into negative advantage, amplifying instability.
Method (QAE). Replace the mean with a K-quantile baseline per query group. This induces a two-regime gate:
- Hard queries (low success rate): reinforce rare successes only.
- Easy queries (high success rate): penalize residual failures only.
A single (K) controls how many responses receive non-zero advantage, balancing exploration/exploitation and yielding two-sided entropy safety under first-order softmax updates.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CURE: Critical-Token-Guided Re-Concatenation for Entropy-Collapse Prevention (2025)
- CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning (2025)
- EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning (2025)
- Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting (2025)
- From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature (2025)
- Learning More with Less: A Dynamic Dual-Level Down-Sampling Framework for Efficient Policy Optimization (2025)
- DCPO: Dynamic Clipping Policy Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper