Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Use SSA to reconcile TrainJob components #2431

Merged
merged 17 commits into from
Feb 26, 2025

Conversation

astefanutti
Copy link
Contributor

@astefanutti astefanutti commented Feb 11, 2025

What this PR does / why we need it:

Replace Get/Create/Update logic with SSA to reconcile TrainJob components.

I've successfully tested it with https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb.

Depends on kubernetes-sigs/jobset#782.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #2297

Checklist:

  • Docs included if any changes are user facing

@astefanutti
Copy link
Contributor Author

/hold

Requires kubernetes-sigs/jobset#782

@astefanutti
Copy link
Contributor Author

astefanutti commented Feb 11, 2025

@andreyvelich @tenzen-y @varshaprasad96 please take a look when you can.

@astefanutti astefanutti force-pushed the pr-ssa branch 2 times, most recently from 8e002f8 to 465e443 Compare February 11, 2025 13:35
go.mod Outdated
@@ -19,7 +19,7 @@ require (
sigs.k8s.io/controller-runtime v0.19.1
sigs.k8s.io/jobset v0.5.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specify main branch, and then let us check if it should work fine as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I assumed you preferred to wait until the upcoming JobSet release :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current trainer development is under alpha stage. So, I don't care the library stability :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @astefanutti Should we merge this PR once JobSet release v0.8.0 ?
I think, we are planning to release it this week: kubernetes-sigs/jobset#774

cc @kannon92 @ahg-g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced the JobSet API version to v0.8.0-devel for now.

I'm working on adapting the unit tests but the main controller should work fine.

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Feb 13, 2025
@astefanutti astefanutti force-pushed the pr-ssa branch 4 times, most recently from 0e92a8c to 16a000a Compare February 13, 2025 09:34
@astefanutti astefanutti force-pushed the pr-ssa branch 4 times, most recently from 03d6f54 to b2063e1 Compare February 14, 2025 19:01
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.
My first question is could we avoid unstructured and then keep using client.Object typed?
I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430

go.mod Outdated
Comment on lines 3 to 5
go 1.23.0

toolchain go1.23.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use these toolchains? Who is bringing this? JobSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the replace added to override the transitive dependencies coming from the newer JobSet version, go mod tidy keeps adding this. I'd need to dig into it more to understand why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, interesting. Actually, Go < 1.23.4, they enforce to specify toolchain. But after the Go 1.23.4, the restrictions are relaxed. So, IIUC, this technically could be removed.

Or, we may just bump to Kube 1.32 as library. @andreyvelich Do you want to avoid bumping kube lib to 1.32 here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should try to avoid toolchain here if that is possible.
Let's bump k8s to 1.32

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Should we bump k8s libs to 1.32 in this PR to remove toolchain ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich yes, do you prefer the upgrade to be done in this PR or separately?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's maybe first create PR to migrate to 1.32 if that is easier for you.

@astefanutti
Copy link
Contributor Author

Thanks for doing this. My first question is could we avoid unstructured and then keep using client.Object typed? I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430

Ultimately the client.Patch method converts the object into an unstructured object for SSA, because that's the only way to guarantee no default fields get introduced.

I can look at it again, but from my experience and understanding, relying on apply configuration -> unstructured is the most reliable and canonical way do to SSA.

@astefanutti
Copy link
Contributor Author

Note I still need to update the newly introduced MPI plugin to rely on apply configurations.

@tenzen-y
Copy link
Member

Thanks for doing this. My first question is could we avoid unstructured and then keep using client.Object typed? I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430

Ultimately the client.Patch method converts the object into an unstructured object for SSA, because that's the only way to guarantee no default fields get introduced.

I can look at it again, but from my experience and understanding, relying on apply configuration -> unstructured is the most reliable and canonical way do to SSA.

Thank you for checking that. If you think introducing the non unstructured objects with SSA has any potential "risks", please let me know. My understanding might be incorrect for SSA

@tenzen-y
Copy link
Member

Note I still need to update the newly introduced MPI plugin to rely on apply configurations.

Yeah, absolutely

@astefanutti
Copy link
Contributor Author

astefanutti commented Feb 17, 2025

Thanks for doing this. My first question is could we avoid unstructured and then keep using client.Object typed? I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430

Ultimately the client.Patch method converts the object into an unstructured object for SSA, because that's the only way to guarantee no default fields get introduced.
I can look at it again, but from my experience and understanding, relying on apply configuration -> unstructured is the most reliable and canonical way do to SSA.

Thank you for checking that. If you think introducing the non unstructured objects with SSA has any potential "risks", please let me know. My understanding might be incorrect for SSA

Using client.Object may work as an approximation to compute SSA patches for simpler objects, but it needs some tweaks like the following one that feel like work-arounds and that may not "scale" well for more complex objects like JobSets:
https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L509

I think the target should ultimately be to rely on the runtime.ApplyConfiguration interface that's going to be introduced: kubernetes/kubernetes#129313.

For now, I've moved the conversion to unstructured just before the call to client.Patch, but that'll go away once controller-runtime integrates runtime.ApplyConfiguration and exposes a first-class client.Apply method for SSA.

@astefanutti
Copy link
Contributor Author

/unhold

All the tests pass now.

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
@tenzen-y
Copy link
Member

@astefanutti Could you check commits signs?

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
@astefanutti
Copy link
Contributor Author

@astefanutti Could you check commits signs?

@tenzen-y should be good now, I need more coffee :)

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
/lgtm
/approve
/hold cancel

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 3c6c90f into kubeflow:master Feb 26, 2025
14 checks passed
@astefanutti astefanutti deleted the pr-ssa branch February 26, 2025 08:44
manojks1999 pushed a commit to manojks1999/trainer that referenced this pull request Mar 2, 2025
* KEP-2170: Use SSA to reconcile TrainJob components

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Enable Unstructured caching in controller manager config

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Fix PodGroup apply configuration

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* API to apply config conversion util functions

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Only add namespace to TrainingRuntime object key

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Fix EnvVar upsert

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Update unit tests

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Fix JobSet resource requirements

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Resolve build issues with launcher job

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Use apply config for MPI ConfigMap and Secret

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* ComponentBuilderPlugin now returns an array

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Use plain apply configurations instead of unstructured

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Use apply config in EnforceMLPolicy plugins

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Do not update JobSets that are not suspended

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Address review feedback

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Do not update PodGroup if TrainJob is not suspended

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Remove obsolete TODO

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

---------

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: manojks1999 <9743manoj@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2170: Replace UPSERT operation for the objects with SSA PATCH
3 participants