Problem
The agentic RL learner (tunix/rl/agentic/agentic_grpo_learner.py) currently hardcodes GRPO-style advantage computation. It does not use the advantage_estimator field from AlgorithmConfig or the function registry, meaning alternative estimators like RLOO and DrGRPO cannot be used for multi turn agentic training (tool-use, reasoning chains, etc.).
This is a significant limitation because:
- RLOO's lower variance baseline is especially valuable in agentic settings where trajectory rewards are noisy due to tool call stochasticity
- DrGRPO's unnormalized advantages can be beneficial when reward distributions shift across agentic episodes
- The non agentic GRPO learner already supports pluggable advantage estimators via
function_registry.get_advantage_estimator(), but the agentic learner bypasses this
Proposed Solution
Refactor AgenticGRPOLearner._compute_advantages() to route through function_registry.get_advantage_estimator(self.algo_config.advantage_estimator) instead of computing group relative advantages inline. This would:
- Enable RLOO, DrGRPO, and any future estimators for agentic RL with zero additional code
- Align the agentic and non agentic learner codepaths
- Preserve backward compatibility (default remains
"grpo")
References
Problem
The agentic RL learner (
tunix/rl/agentic/agentic_grpo_learner.py) currently hardcodes GRPO-style advantage computation. It does not use theadvantage_estimatorfield fromAlgorithmConfigor the function registry, meaning alternative estimators like RLOO and DrGRPO cannot be used for multi turn agentic training (tool-use, reasoning chains, etc.).This is a significant limitation because:
function_registry.get_advantage_estimator(), but the agentic learner bypasses thisProposed Solution
Refactor
AgenticGRPOLearner._compute_advantages()to route throughfunction_registry.get_advantage_estimator(self.algo_config.advantage_estimator)instead of computing group relative advantages inline. This would:"grpo")References
tunix/rl/grpo/grpo_learner.pylines 307-312tunix/rl/agentic/agentic_grpo_learner.py