Skip to content

Support Elastic PyTorch in TrainJob #2903

@andreyvelich

Description

@andreyvelich

What you would like to be added?

As part of Kubeflow Trainer API, we designed the ElasticPolicy API which should allow users to run PyTorch in elastic mode: https://docs.pytorch.org/docs/stable/elastic/run.html

It would be nice if someone could drive KEP for that, and propose API changes to the next release: Trainer v2.2

elasticPolicy:
  minNodes: 2
  maxNodes: 5
  metrics:
    - type: Resource
      resource:
        name: nvidia.com/gpu
        target:
          type: Utilization
          averageUtilization: 75

Potentially, we should support elastic JobSet for that: kubernetes-sigs/jobset#463

cc @kubeflow/kubeflow-trainer-team

/area api
/area controller
/help

Why is this needed?

We should support Elastic TrainJobs like in Training Operator V1 PyTorchJob.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions