Support configurable advantage estimation in AgenticRLLearner (RLOO, DrGRPO)

## Problem

The agentic RL learner (`tunix/rl/agentic/agentic_grpo_learner.py`) currently hardcodes GRPO-style advantage computation. It does not use the `advantage_estimator` field from `AlgorithmConfig` or the function registry, meaning alternative estimators like RLOO and DrGRPO cannot be used for multi turn agentic training (tool-use, reasoning chains, etc.).

This is a significant limitation because:
- RLOO's lower variance baseline is especially valuable in agentic settings where trajectory rewards are noisy due to tool call stochasticity
- DrGRPO's unnormalized advantages can be beneficial when reward distributions shift across agentic episodes
- The non agentic GRPO learner already supports pluggable advantage estimators via `function_registry.get_advantage_estimator()`, but the agentic learner bypasses this

## Proposed Solution

Refactor `AgenticGRPOLearner._compute_advantages()` to route through `function_registry.get_advantage_estimator(self.algo_config.advantage_estimator)` instead of computing group relative advantages inline. This would:

1. Enable RLOO, DrGRPO, and any future estimators for agentic RL with zero additional code
2. Align the agentic and non agentic learner codepaths
3. Preserve backward compatibility (default remains `"grpo"`)

## References

- Non agentic advantage routing: `tunix/rl/grpo/grpo_learner.py` lines 307-312
- Agentic hardcoded computation: `tunix/rl/agentic/agentic_grpo_learner.py`
- RLOO learner PR: #1377
- RLOO paper: https://arxiv.org/abs/2402.14740

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support configurable advantage estimation in AgenticRLLearner (RLOO, DrGRPO) #1378

Problem

Proposed Solution

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support configurable advantage estimation in AgenticRLLearner (RLOO, DrGRPO) #1378

Description

Problem

Proposed Solution

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions