Skip to content

Support configurable advantage estimation in AgenticRLLearner (RLOO, DrGRPO) #1378

@kbhujbal

Description

@kbhujbal

Problem

The agentic RL learner (tunix/rl/agentic/agentic_grpo_learner.py) currently hardcodes GRPO-style advantage computation. It does not use the advantage_estimator field from AlgorithmConfig or the function registry, meaning alternative estimators like RLOO and DrGRPO cannot be used for multi turn agentic training (tool-use, reasoning chains, etc.).

This is a significant limitation because:

  • RLOO's lower variance baseline is especially valuable in agentic settings where trajectory rewards are noisy due to tool call stochasticity
  • DrGRPO's unnormalized advantages can be beneficial when reward distributions shift across agentic episodes
  • The non agentic GRPO learner already supports pluggable advantage estimators via function_registry.get_advantage_estimator(), but the agentic learner bypasses this

Proposed Solution

Refactor AgenticGRPOLearner._compute_advantages() to route through function_registry.get_advantage_estimator(self.algo_config.advantage_estimator) instead of computing group relative advantages inline. This would:

  1. Enable RLOO, DrGRPO, and any future estimators for agentic RL with zero additional code
  2. Align the agentic and non agentic learner codepaths
  3. Preserve backward compatibility (default remains "grpo")

References

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions