Skip to content

[Feature] Kubernetes native scheduler support for single-controller mode #724

@d3c3mber

Description

@d3c3mber

Checklist

  • This feature will maintain backward compatibility with the current APIs in
    areal/api/. If not, please raise a refactor issue first.

Background

AReaL single-controller mode supports the local scheduler only. This feature proposal plans to add K8s-native scheduler support for AReaL, for running RL jobs (for both LLM RL training and Agentic RL training) in large-scale clusters in a cloud-native way.

Potential Solution

We plan to add a k8s based scheduler/controller for scheduling and managing AReaL master component and worker components.
It will include the following features:

  • cloud native way for running areal components
  • support for users to adopt customized code from git for rl framework and rollout/train framework

Additional Information

(Add any relevant context, references, or supporting data here.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions