Feature Request: RLOO (REINFORCE Leave-One-Out) Advantage Estimator

## Is your feature request related to a problem? Please describe.

I'm working on the Tunix hackathon and running into stability issues training on rubric-based rewards (for the "show your work" objective). The current GRPO works well when you have binary correct/wrong signals from a verifier, but my setup uses continuous scores from a reward model evaluating reasoning quality.

The problem is GRPO's std normalization amplifies noise when rewards are subjective. I'm also seeing reward hacking since there's no KL mechanism to keep the policy from drifting too far to exploit the reward model.

## Describe the solution you'd like

Add RLOO (REINFORCE Leave-One-Out) as an alternative advantage estimator. The main difference from GRPO:

```python
# GRPO (current)
A_i = (R_i - mean(R)) / std(R)

# DrGRPO (removes std, already exists)
A_i = R_i - mean(R)

# RLOO (leave-one-out baseline + KL in reward)
A_i = R_i - mean(R_j where j != i) - β * KL_i
```

RLOO uses a leave-one-out baseline instead of batch mean, which is more numerically stable. Critically, it folds KL directly into the reward `R'_i = R_i - β * KL_i` before advantage computation, rather than adding it to the loss function afterward.

[Ahmadian et al. 2024](https://arxiv.org/abs/2402.14740) showed RLOO is more robust to noisy rewards than PPO/GRPO and outperforms DPO on preference tasks. It's also what [PRIME](https://arxiv.org/abs/2502.01456) uses under the hood.

## Implementation approach

Verified that Tunix already has the infrastructure for this (function_registry.py):

1. Add `advantage_estimator='rloo'` option in `GRPOConfig`
2. Add `kl_in_reward: bool = False` parameter to control where KL is applied
3. Register new advantage estimator function with `@function_registry.register_advantage_estimator("rloo")` (following drgrpo_learner.py pattern)
4. Modify reward computation in `_generate_and_compute_advantage()` to optionally fold KL into rewards when `kl_in_reward=True`

This follows the existing pluggable advantage estimator pattern and doesn't require a separate learner class.

## Additional context

This would help hackathon participants working on non-verifiable tasks. Right now GRPO is tuned for math/code with binary rewards - RLOO would extend Tunix to creative/subjective domains.

Reference implementations:
- [verl RLOO](https://github.com/volcengine/verl)
- [swift RLOO docs](https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/RLOO.html)

## Checklist

- [x] I have searched the existing issues for similar feature requests.
- [x] I have verified the codebase already supports pluggable advantage estimators via function_registry.
- [x] This is not a support question (please use the "bug template" for that).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: RLOO (REINFORCE Leave-One-Out) Advantage Estimator #953

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Implementation approach

Additional context

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: RLOO (REINFORCE Leave-One-Out) Advantage Estimator #953

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Implementation approach

Additional context

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions