Checklist
Background
AReaL single-controller mode supports the local scheduler only. This feature proposal plans to add K8s-native scheduler support for AReaL, for running RL jobs (for both LLM RL training and Agentic RL training) in large-scale clusters in a cloud-native way.
Potential Solution
We plan to add a k8s based scheduler/controller for scheduling and managing AReaL master component and worker components.
It will include the following features:
- cloud native way for running areal components
- support for users to adopt customized code from git for rl framework and rollout/train framework
Additional Information
(Add any relevant context, references, or supporting data here.)