Skip to content

Add RLOO (REINFORCE Leave-One-Out) learner for lower-variance policy …#1377

Open
kbhujbal wants to merge 1 commit intogoogle:mainfrom
kbhujbal:feat/rloo-learner
Open

Add RLOO (REINFORCE Leave-One-Out) learner for lower-variance policy …#1377
kbhujbal wants to merge 1 commit intogoogle:mainfrom
kbhujbal:feat/rloo-learner

Conversation

@kbhujbal
Copy link
Copy Markdown

@kbhujbal kbhujbal commented Apr 9, 2026

Summary

Implements the RLOO (REINFORCE Leave One Out) advantage estimator as a first class learner in Tunix, addressing #953.

RLOO replaces GRPO's group mean advantage baseline with a leave one out baseline: for each completion, the baseline is the mean reward of all other completions to the same prompt. This yields unbiased advantage estimates with lower variance than standard GRPO, without requiring a separate critic network (unlike PPO).

Changes

  • tunix/rl/grpo/rloo_learner.pyRLOOConfig and RLOOLearner following the same pattern as DrGRPOConfig/DrGRPOLearner
  • tunix/rl/algorithm_config.py — Register "rloo" and "drgrpo" as valid algo variants and advantage estimators
  • tunix/cli/grpo_main.py — Route to RLOOLearner when advantage_estimator=rloo is set in grpo_config
  • tunix/cli/base_config.yaml — Document advantage_estimator and loss_agg_mode config options
  • tests/rl/grpo/rloo_learner_test.py — Comprehensive tests (config, advantage math, e2e training)
  • examples/rl/rloo/gsm8k/run_qwen3.sh — Example training script for GSM8K with Qwen3

Usage

Switch from GRPO to RLOO with a single config change:

grpo_config.advantage_estimator=rloo

Or via Python API:

from tunix.rl.grpo.rloo_learner import RLOOConfig, RLOOLearner

config = RLOOConfig(num_generations=8, beta=0.04, epsilon=0.2)
learner = RLOOLearner(rl_cluster=cluster, algo_config=config, reward_fns=reward_fn)
learner.train(dataset)

Resolves #953

Reference

Ahmadian et al., "Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs", 2024.

Checklist

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and all unit tests pass.
  • I have added all appropriate doc-strings/documentation.
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have signed the Contributor License Agreement.
  • I have followed Contribution Guidelines.

…optimization

Implements the RLOO advantage estimator as a first-class learner in Tunix,
addressing google#953. RLOO uses leave-one-out baselines instead of group-mean
baselines, providing unbiased advantage estimates with lower variance than
standard GRPO — without requiring a separate critic network.

Changes:
- Add RLOOConfig and RLOOLearner (tunix/rl/grpo/rloo_learner.py)
- Register "rloo" and "drgrpo" as valid algo variants and advantage
  estimators in AlgorithmConfig
- Add CLI integration: grpo_main.py routes to RLOOLearner when
  advantage_estimator=rloo is set in grpo_config
- Document advantage_estimator and loss_agg_mode options in base_config.yaml
- Add comprehensive tests for config, advantage computation, and
  end-to-end training
- Add example training script for GSM8K with Qwen3

Reference: Ahmadian et al., "Back to Basics: Revisiting REINFORCE-Style
Optimization for Learning from Human Feedback in LLMs", 2024.
https://arxiv.org/abs/2402.14740
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: RLOO (REINFORCE Leave-One-Out) Advantage Estimator

2 participants