Open
Description
This is the tracking issue for the Kubeflow Training V2 APIs.
GitHub project: https://github.com/orgs/kubeflow/projects/64
- Create the KEP: KEP-2170: Kubeflow Training V2 API #2171
- KEP-2170: Add TrainJob and TrainingRuntime APIs #2223
- KEP-2170: Create controller for TrainJob #2207
- KEP-2170: Create Kustomize manifests to deploy JobSet and TrainJob controllers #2208
- KEP-2170: Implement validations for TrainJob #2209
- KEP-2170: Implement validations for TrainingRuntime and ClusterTrainingRuntime #2219
- KEP-2170: Create PyTorch multi-node distributed training runtime #2211
- KEP-2170: Create dataset and model initializers #2210
- KEP-2170: Create LLM training runtime for Llama 3.1 8B #2212
- KEP-2170: Add PyTorch DDP MNIST training example #2387
- KEP-2170: Add E2E tests for Kubeflow Training V2 #2213
- KEP-2170: Update documentation for V2 APIs #2214
- KEP-2170: Design Kubeflow Python SDK for Training V2 #2216
- KEP-2170: Generate OpenAPI spec for V2 APIs #2215
- KEP-2170: Implement Job Pipeline Framework plugins #2290
- KEP-2170: Replace UPSERT operation for the objects with SSA PATCH #2297
JobSet improvements:
- Support for the execution policy API in JobSet kubernetes-sigs/jobset#672
- Support Stateful JobSet kubernetes-sigs/jobset#572
- PyTorch elastic support: Support Autoscaling replicatedJob kubernetes-sigs/jobset#570
Love this feature?
Give it a 👍 We prioritize the features with most 👍
Metadata
Assignees
Type
Projects
Status
In Progress