Add RLOO (REINFORCE Leave-One-Out) learner for lower-variance policy … by kbhujbal · Pull Request #1377 · google/tunix

kbhujbal · 2026-04-09T07:34:23Z

Summary

Implements the RLOO (REINFORCE Leave One Out) advantage estimator as a first class learner in Tunix, addressing #953.

RLOO replaces GRPO's group mean advantage baseline with a leave one out baseline: for each completion, the baseline is the mean reward of all other completions to the same prompt. This yields unbiased advantage estimates with lower variance than standard GRPO, without requiring a separate critic network (unlike PPO).

Changes

tunix/rl/grpo/rloo_learner.py — RLOOConfig and RLOOLearner following the same pattern as DrGRPOConfig/DrGRPOLearner
tunix/rl/algorithm_config.py — Register "rloo" and "drgrpo" as valid algo variants and advantage estimators
tunix/cli/grpo_main.py — Route to RLOOLearner when advantage_estimator=rloo is set in grpo_config
tunix/cli/base_config.yaml — Document advantage_estimator and loss_agg_mode config options
tests/rl/grpo/rloo_learner_test.py — Comprehensive tests (config, advantage math, e2e training)
examples/rl/rloo/gsm8k/run_qwen3.sh — Example training script for GSM8K with Qwen3

Usage

Switch from GRPO to RLOO with a single config change:

grpo_config.advantage_estimator=rloo

Or via Python API:

from tunix.rl.grpo.rloo_learner import RLOOConfig, RLOOLearner

config = RLOOConfig(num_generations=8, beta=0.04, epsilon=0.2)
learner = RLOOLearner(rl_cluster=cluster, algo_config=config, reward_fns=reward_fn)
learner.train(dataset)

Resolves #953

Reference

Ahmadian et al., "Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs", 2024.

Checklist

I have added all the necessary unit tests for my change.
I have verified that my change does not break existing code and all unit tests pass.
I have added all appropriate doc-strings/documentation.
My PR is based on the latest changes of the main branch (if unsure, rebase the code).
I have signed the Contributor License Agreement.
I have followed Contribution Guidelines.

…optimization Implements the RLOO advantage estimator as a first-class learner in Tunix, addressing google#953. RLOO uses leave-one-out baselines instead of group-mean baselines, providing unbiased advantage estimates with lower variance than standard GRPO — without requiring a separate critic network. Changes: - Add RLOOConfig and RLOOLearner (tunix/rl/grpo/rloo_learner.py) - Register "rloo" and "drgrpo" as valid algo variants and advantage estimators in AlgorithmConfig - Add CLI integration: grpo_main.py routes to RLOOLearner when advantage_estimator=rloo is set in grpo_config - Document advantage_estimator and loss_agg_mode options in base_config.yaml - Add comprehensive tests for config, advantage computation, and end-to-end training - Add example training script for GSM8K with Qwen3 Reference: Ahmadian et al., "Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs", 2024. https://arxiv.org/abs/2402.14740

kbhujbal requested review from abheesht17, hgao327, jiangyangmu, lc5211, s-noghabi, sizhit2, tianshub and wang2yn84 as code owners April 9, 2026 07:34

github-actions bot assigned hgao327 Apr 9, 2026

kbhujbal mentioned this pull request Apr 9, 2026

Support configurable advantage estimation in AgenticRLLearner (RLOO, DrGRPO) #1378

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RLOO (REINFORCE Leave-One-Out) learner for lower-variance policy …#1377

Add RLOO (REINFORCE Leave-One-Out) learner for lower-variance policy …#1377
kbhujbal wants to merge 1 commit intogoogle:mainfrom
kbhujbal:feat/rloo-learner

kbhujbal commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kbhujbal commented Apr 9, 2026

Summary

Changes

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants