-
Notifications
You must be signed in to change notification settings - Fork 772
Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count #2539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
322cb5b
to
1556321
Compare
/assign @kubeflow/wg-training-leads @astefanutti |
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
1556321
to
bcbe608
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating this @tenzen-y!
if trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.NumNodes != nil { | ||
numNodes = trainJob.Spec.Trainer.NumNodes | ||
info.FindPodSetByName(constants.JobTrainerNode).Count = trainJob.Spec.Trainer.NumNodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we are going to switch to the model with trainjob-ancestor-step
label, how are we going to find the appropriate PodSet ?
I guess, we still have to add ReplicatedJob name to the PodSet Pod name, but for MPI, we should assign count=NumNodes-1
for Node ReplicatedJob, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just add Ancestor
field to PodSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, PodSet.Name
represnts rJob name.
@@ -50,7 +49,7 @@ type Info struct { | |||
} | |||
|
|||
type RuntimePolicy struct { | |||
MLPolicy *trainer.MLPolicy | |||
MLPolicySource *trainer.MLPolicySource | |||
PodGroupPolicy *trainer.PodGroupPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename it for PodGroupPolicySource for consistency ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I think this PodGroupPolicy
should be fine since this MLPolicySource
introduction is the reason for dropping numNodes.
So, currently, I think this PodGroupPolicy
should be fine, IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tenzen-y!
Feel free to unhold
/lgtm
/approve
/hold
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thank you. |
What this PR does / why we need it:
I replaced the
runtime.Info.RuntimePolicy.MLPolicy
withruntime.Info.RuntimePolicy.MLPolicySource
so that we can store all rJob numNodes toruntime.Info.TemplateSpec.PodSet.Count
.Previously, we stored the non-Trainer numNodes to
runtime.Info.TemplateSpec.PodSet.Count
and stored the Trainer numNodes toruntime.Info.RuntimePolicy.MLPolicy.NumNodes
, which introduce internal data structure fragmentation.Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Part-of: #2495
Checklist: