-
Notifications
You must be signed in to change notification settings - Fork 847
Open
Description
What you would like to be added?
As part of Kubeflow Trainer API, we designed the ElasticPolicy API which should allow users to run PyTorch in elastic mode: https://docs.pytorch.org/docs/stable/elastic/run.html
It would be nice if someone could drive KEP for that, and propose API changes to the next release: Trainer v2.2
elasticPolicy:
minNodes: 2
maxNodes: 5
metrics:
- type: Resource
resource:
name: nvidia.com/gpu
target:
type: Utilization
averageUtilization: 75Potentially, we should support elastic JobSet for that: kubernetes-sigs/jobset#463
cc @kubeflow/kubeflow-trainer-team
/area api
/area controller
/help
Why is this needed?
We should support Elastic TrainJobs like in Training Operator V1 PyTorchJob.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
stivanov-intercom, Garvit-77, astefanutti, Allyyi and utruong309