Quantile Advantage Estimation (QAE): A One-Line Baseline Swap for Entropy-Safe RL Reasoning

🧠 Introduction

Problem. In value-free RL for LLM reasoning (e.g., GRPO/DAPO), training often oscillates between entropy explosion (over-random updates driven by negative advantages) and entropy collapse (premature determinism), hurting scaling.

Observation. The group mean baseline is brittle under reward outliers: it inflates the baseline and turns many plausible responses into negative advantage, amplifying instability.

Method (QAE). Replace the mean with a K-quantile baseline per query group. This induces a two-regime gate:

Hard queries (low success rate): reinforce rare successes only.
Easy queries (high success rate): penalize residual failures only.

A single (K $\in$ (0,1)) controls how many responses receive non-zero advantage, balancing exploration/exploitation and yielding two-sided entropy safety under first-order softmax updates.

🔩 One-Line Core Change

File: ./verl/trainer/ppo/core_algos.py (lines ~315–319)

quantile_k = config.get("quantile_k", -1.0) if config else -1.0
if 0 < quantile_k < 1:
    id2mean[idx] = torch.quantile(scores_tensor, quantile_k)
else:
    id2mean[idx] = torch.mean(scores_tensor)

If 0 < quantile_k < 1, the baseline becomes the K-quantile; otherwise it falls back to the mean (exactly GRPO/DAPO behavior).
No other algorithmic changes are required.

✨ Getting Started

We inherit environment setup and quick start from VERL. Please follow the official docs:

Install: https://verl.readthedocs.io/en/latest/start/install.html
Quick Start: https://verl.readthedocs.io/en/latest/start/quickstart.html

This repo only changes the DAPO recipe by adding a single argument quantile_k. Original DAPO scripts for reference: https://github.com/volcengine/verl/tree/main/recipe/dapo

⚙️ Training

We provide three ready-to-run scripts (paths relative to verl/):

./recipe/qae/run_dapo_qwen2.5_32b.sh
./recipe/qae/run_dapo_qwen3-14b-base.sh
./recipe/qae/run_dapo_qwen3-8b-base.sh

What changed in the scripts?

We only pass one extra flag to the DAPO launcher, e.g.:

- python3 -m recipe.dapo.main_dapo ...
+ python3 -m recipe.dapo.main_dapo ++algorithm.quantile_k=0.4 ...

If your launcher loads a YAML config, you can equivalently add:

# in your training config
quantile_k: 0.4

Both forms are supported—the trainer reads quantile_k from the merged config.

📊 Results & Figures

Training dynamics (entropy vs. pass@k): QAE suppresses the early entropy spike while improving pass@1, with pass@16 comparable to the mean-baseline recipe.
Credit assignment sparsity: ≈80% of responses maintain zero advantage, concentrating updates on informative samples.
Composability: QAE composes with token-level methods (e.g., CLIP-COV, KL-COV) and sequence-level GSPO, providing drop-in gains.

🧪 Hyperparameter Tips (`quantile_k`)

Role. quantile_k controls the fraction of responses with non-zero advantage per group.
- Larger K → fewer non-zeros → more exploration (prevents collapse).
- Smaller K → more non-zeros → more exploitation (tames explosion).
Recommended defaults.
- Start with quantile_k = 0.4 (stable with DAPO/Clip-Higher).
- If you observe early entropy collapse, increase to 0.6.
- Tune by monitoring training entropy in addition to accuracy; a single-knob adjustment is usually enough.
Why sequence-level helps. Token-level controls (clipping/KL) rescale steps but do not change the response-level baseline; QAE fixes the baseline itself, which directly regulates the sign/sparsity of advantages.

🎈 Citation

@article{wu2025qae,
  title   = {Quantile Advantage Estimation for Entropy-Safe Reasoning},
  author  = {Junkang Wu and Kexin Huang and Jiancan Wu and An Zhang and Xiang Wang and Xiangnan He},
  year    = {2025},
  journal = {arXiv preprint},
}

🌻 Acknowledgement

We build on verl and standard math-reasoning evaluation protocols. QAE is orthogonal to token-level regularizers (e.g., Clip-Cov, KL-Cov and composes with GSPO).

📬 Contact

Junkang Wu — jkwu0909@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
figures		figures
recipe		recipe
verl		verl
QAE.pdf		QAE.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quantile Advantage Estimation (QAE): A One-Line Baseline Swap for Entropy-Safe RL Reasoning

🧠 Introduction

🔩 One-Line Core Change

✨ Getting Started

⚙️ Training

What changed in the scripts?

📊 Results & Figures

🧪 Hyperparameter Tips (`quantile_k`)

🎈 Citation

🌻 Acknowledgement

📬 Contact

About

Uh oh!

Releases

Packages

Languages

junkangwu/QAE

Folders and files

Latest commit

History

Repository files navigation

Quantile Advantage Estimation (QAE): A One-Line Baseline Swap for Entropy-Safe RL Reasoning

🧠 Introduction

🔩 One-Line Core Change

✨ Getting Started

⚙️ Training

What changed in the scripts?

📊 Results & Figures

🧪 Hyperparameter Tips (quantile_k)

🎈 Citation

🌻 Acknowledgement

📬 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🧪 Hyperparameter Tips (`quantile_k`)

Packages