Add RLOO (REINFORCE Leave-One-Out) learner for lower-variance policy …#1377
Open
kbhujbal wants to merge 1 commit intogoogle:mainfrom
Open
Add RLOO (REINFORCE Leave-One-Out) learner for lower-variance policy …#1377kbhujbal wants to merge 1 commit intogoogle:mainfrom
kbhujbal wants to merge 1 commit intogoogle:mainfrom
Conversation
…optimization Implements the RLOO advantage estimator as a first-class learner in Tunix, addressing google#953. RLOO uses leave-one-out baselines instead of group-mean baselines, providing unbiased advantage estimates with lower variance than standard GRPO — without requiring a separate critic network. Changes: - Add RLOOConfig and RLOOLearner (tunix/rl/grpo/rloo_learner.py) - Register "rloo" and "drgrpo" as valid algo variants and advantage estimators in AlgorithmConfig - Add CLI integration: grpo_main.py routes to RLOOLearner when advantage_estimator=rloo is set in grpo_config - Document advantage_estimator and loss_agg_mode options in base_config.yaml - Add comprehensive tests for config, advantage computation, and end-to-end training - Add example training script for GSM8K with Qwen3 Reference: Ahmadian et al., "Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs", 2024. https://arxiv.org/abs/2402.14740
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the RLOO (REINFORCE Leave One Out) advantage estimator as a first class learner in Tunix, addressing #953.
RLOO replaces GRPO's group mean advantage baseline with a leave one out baseline: for each completion, the baseline is the mean reward of all other completions to the same prompt. This yields unbiased advantage estimates with lower variance than standard GRPO, without requiring a separate critic network (unlike PPO).
Changes
tunix/rl/grpo/rloo_learner.py—RLOOConfigandRLOOLearnerfollowing the same pattern asDrGRPOConfig/DrGRPOLearnertunix/rl/algorithm_config.py— Register"rloo"and"drgrpo"as valid algo variants and advantage estimatorstunix/cli/grpo_main.py— Route toRLOOLearnerwhenadvantage_estimator=rloois set ingrpo_configtunix/cli/base_config.yaml— Documentadvantage_estimatorandloss_agg_modeconfig optionstests/rl/grpo/rloo_learner_test.py— Comprehensive tests (config, advantage math, e2e training)examples/rl/rloo/gsm8k/run_qwen3.sh— Example training script for GSM8K with Qwen3Usage
Switch from GRPO to RLOO with a single config change:
Or via Python API:
Resolves #953
Reference
Ahmadian et al., "Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs", 2024.
Checklist