Is your feature request related to a problem? Please describe.
I'm working on the Tunix hackathon and running into stability issues training on rubric-based rewards (for the "show your work" objective). The current GRPO works well when you have binary correct/wrong signals from a verifier, but my setup uses continuous scores from a reward model evaluating reasoning quality.
The problem is GRPO's std normalization amplifies noise when rewards are subjective. I'm also seeing reward hacking since there's no KL mechanism to keep the policy from drifting too far to exploit the reward model.
Describe the solution you'd like
Add RLOO (REINFORCE Leave-One-Out) as an alternative advantage estimator. The main difference from GRPO:
# GRPO (current)
A_i = (R_i - mean(R)) / std(R)
# DrGRPO (removes std, already exists)
A_i = R_i - mean(R)
# RLOO (leave-one-out baseline + KL in reward)
A_i = R_i - mean(R_j where j != i) - β * KL_i
RLOO uses a leave-one-out baseline instead of batch mean, which is more numerically stable. Critically, it folds KL directly into the reward R'_i = R_i - β * KL_i before advantage computation, rather than adding it to the loss function afterward.
Ahmadian et al. 2024 showed RLOO is more robust to noisy rewards than PPO/GRPO and outperforms DPO on preference tasks. It's also what PRIME uses under the hood.
Implementation approach
Verified that Tunix already has the infrastructure for this (function_registry.py):
- Add
advantage_estimator='rloo' option in GRPOConfig
- Add
kl_in_reward: bool = False parameter to control where KL is applied
- Register new advantage estimator function with
@function_registry.register_advantage_estimator("rloo") (following drgrpo_learner.py pattern)
- Modify reward computation in
_generate_and_compute_advantage() to optionally fold KL into rewards when kl_in_reward=True
This follows the existing pluggable advantage estimator pattern and doesn't require a separate learner class.
Additional context
This would help hackathon participants working on non-verifiable tasks. Right now GRPO is tuned for math/code with binary rewards - RLOO would extend Tunix to creative/subjective domains.
Reference implementations:
Checklist
Is your feature request related to a problem? Please describe.
I'm working on the Tunix hackathon and running into stability issues training on rubric-based rewards (for the "show your work" objective). The current GRPO works well when you have binary correct/wrong signals from a verifier, but my setup uses continuous scores from a reward model evaluating reasoning quality.
The problem is GRPO's std normalization amplifies noise when rewards are subjective. I'm also seeing reward hacking since there's no KL mechanism to keep the policy from drifting too far to exploit the reward model.
Describe the solution you'd like
Add RLOO (REINFORCE Leave-One-Out) as an alternative advantage estimator. The main difference from GRPO:
RLOO uses a leave-one-out baseline instead of batch mean, which is more numerically stable. Critically, it folds KL directly into the reward
R'_i = R_i - β * KL_ibefore advantage computation, rather than adding it to the loss function afterward.Ahmadian et al. 2024 showed RLOO is more robust to noisy rewards than PPO/GRPO and outperforms DPO on preference tasks. It's also what PRIME uses under the hood.
Implementation approach
Verified that Tunix already has the infrastructure for this (function_registry.py):
advantage_estimator='rloo'option inGRPOConfigkl_in_reward: bool = Falseparameter to control where KL is applied@function_registry.register_advantage_estimator("rloo")(following drgrpo_learner.py pattern)_generate_and_compute_advantage()to optionally fold KL into rewards whenkl_in_reward=TrueThis follows the existing pluggable advantage estimator pattern and doesn't require a separate learner class.
Additional context
This would help hackathon participants working on non-verifiable tasks. Right now GRPO is tuned for math/code with binary rewards - RLOO would extend Tunix to creative/subjective domains.
Reference implementations:
Checklist