-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-4671 : Gang Scheduling #5558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Which one should we look at? Or this one? |
This one - I already closed the original one with appropriate comment. [This one contains the orignal and starts filling the remianing sections.] |
|
sorry I just saw the comment: #5545 (comment) |
|
|
||
| - It is not a goal to take away the responsibility from controllers to create pods. | ||
| - It is not a goal to bring fairness or multiple workload queues into kube-scheduler. Kueue and Volcano.sh will continue to provide this. | ||
| - It is not a goal to be able to map all the declarative state and behaviors of all workloads into ths `Workload` object. It will focus on state that is relevant to kube-scheduler, and possibly to cluster autoscalers, reschedulers and closely related components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - It is not a goal to be able to map all the declarative state and behaviors of all workloads into ths `Workload` object. It will focus on state that is relevant to kube-scheduler, and possibly to cluster autoscalers, reschedulers and closely related components. | |
| - It is not a goal to be able to map all the declarative state and behaviors of all workloads into the `Workload` object. It will focus on state that is relevant to kube-scheduler, and possibly to cluster autoscalers, reschedulers and closely related components. |
| - It is not a goal to take away the responsibility from controllers to create pods. | ||
| - It is not a goal to bring fairness or multiple workload queues into kube-scheduler. Kueue and Volcano.sh will continue to provide this. | ||
| - It is not a goal to be able to map all the declarative state and behaviors of all workloads into ths `Workload` object. It will focus on state that is relevant to kube-scheduler, and possibly to cluster autoscalers, reschedulers and closely related components. | ||
| - Introducing a resource reservation that can later hold pods. This feature seems desirable, and will be informed by experience gained from _Gang Scheduling woth using Workload Object_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what "Gang Scheduling woth using Workload Object" meant here. Typo most likely.
| - name: "pg1" | ||
| gangMode: Single | ||
| minCount: 100 | ||
| schedulingTimeoutSeconds: 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to use a time.Duration here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schedulingTimeoutSeconds and minCount should be part of a gangSchedulingPolicy. GangSchedulingPolicy type is not currently embedded in PodGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is part of the PodGroup, as far as I can see. Anyways, this is orthogonal to what I mentioned in my comment, no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ricardomaraschini - we have examples of both in the APIs; we will decide on that based on API approvers during API review. For now added a TODO in the type definition below
@ingvagabund - updated to mention specification gangSchedulingPolicy - PTAL if that's what you meant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Thank you. There's also the current example that still have both fields on the podGroup level:
spec:
podGroups: # or gangGroups -- TBD
- name: "pg1"
gangMode: Single
minCount: 100
schedulingTimeoutSeconds: 60->
spec:
podGroups: # or gangGroups -- TBD
- name: "pg1"
gangMode: Single
gangSchedulingPolicy:
minCount: 100
schedulingTimeoutSeconds: 60|
|
||
| ## Summary | ||
|
|
||
| In this KEP, kube-scheduler is modified to support gang scheduling[^1]. To implement gang scheduling, kube-scheduler identifies pods that are in a group and waits until all pods reach the same stage of the scheduling/binding cycle before allowing any pods from the group to advance past that point. If not all pods can reach that point before a timeout expires, then the scheduler stops trying to schedule that group, and all pods release all their resources. This allows other workloads to try to allocate those resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a Workload may contain multiple groups what is the expected kube-scheduler behavior when we have reached the desired number of pods for a single group but not for all ? We keep the whole Workload from scheduling ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each group has its own timeout. So I presume each group will eventually time out. So if one group times out is there any point of keeping others around? One could get to a case where one group times out while another one is getting retried and again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then the scheduler stops trying to schedule that group
Will there be any retry when a group times out and the resources get released? Or, does it mean the pods enter a failed state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we think about individual groups (gangs and/or gang-replicas) is that they are to independent from each other. They all form a workload, but can operate independently from each other.
So if we can schedule one group but not the other - this is still fine, we should schedule and run it.
If the underlying intention is that you really need all groups or none of them, then they are no longer separate groups - they should form a single gang.
[The structure will become more important in future extensions, where e.g. we may want to schedule multiple gangs topologically close to each other. But that is not part of this KEP.]
I added a paragraph about it into the API section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t consider expanding the summary section to allow readers distinguishing the gang from workload concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be any retry when a group times out and the resources get released? Or, does it mean the pods enter a failed state?
Currently a group will be continuously retried and in this or the following KEPs, the workload status will be set to "unschedulable". Individual pods may be get unschedulable status as well, but only those which failed to schedule in a given attempt.
Long term we don't expect workload pods to become unschedulable, as we plan to propose introducing a new workload-scheduling phase which will produce a proposed placement (or get a proposed placement from the external workload scheduler). If pods still fail to schedule, it means the proposed placement was wrong and needs to be modified.
| // WorkloadSpec describes a workload in a portable way that scheduler and related | ||
| // tools can understand. | ||
| type WorkloadSpec struct { | ||
| // Optional but recommended to set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Optional but recommended to set. | |
| // ControllerRef points to the true workload, e.g. Deployment. It is optional | |
| // and it is intended to make this mapping easier for things like CLI tools. |
64c1d92 to
820c6ea
Compare
|
/assign @sanposhiho @dom4ha I think this is pretty much ready for Alpha review, PTAL |
| * Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named | ||
| * In a future , we expect that this group can optionally specify further subdivision into sub groups. Each sub-group can have an index. The indexes go from 0 to N, without repeats or gaps. These subgroups are called | ||
| * In subsequent KEPs, we expect that a sub-group can optionally specify further subdivision into pod equivalence classes. All pods in a pod equivalence class have the same values for all fields that affect scheduling feasibility. These pod equivalence classes are called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You accidentally removed the names for these three
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops - fixed now.
| * `Workload` is the resource Kind. | ||
| * `scheduling` is the ApiGroup. | ||
| * `spec.workload` is the name of the new field in pod. | ||
| * Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named | |
| * Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named `PodGroup`. |
ahg-g
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I left a couple of clarifying comments.
| - Implement the first version of `Workload` API necessary for defining a Gang | ||
| - Ensuring that we can extend `Workload` API in backward compatible way toward north-star API | ||
| - Ensuring that `Workload` API will be usable for both built-in and third-party workload controllers and APIs | ||
| - Implement first version of gang-scheduling in kube-scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What scheduling constraints will or will not be supported? pod affinity/anti-affinity, node affinity/selector, pod topology spread? if all of them, then I would mention that explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - we want to support all existing features - though potentially not in an optimal way yet.
Added.
soltysh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, the biggest one that I'd like to get more clarity is about the workload inferring capabilities, which I'd like to get clarified before we move forward with this.
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: | ||
| "metadata.annotations['batch.kubernetes.io/job-completion-index']" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Moreover, the visibility into issues (debuggability) will depend on [#5510], but we don't | ||
| treat it as a blocker. | ||
|
|
||
| [#5510]: https://github.com/kubernetes/enhancements/pull/5510 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it's better to use the KEP tracking issue rather than a particular PR, so #5501
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
|
||
| // PodGroupPolicy defines scheduling configuration of a PodGroup. | ||
| type PodGroupPolicy struct { | ||
| // Exactly one of the policies should be set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: from api perspective you'll want to add a discriminator field, expressing which policy is at play. That's what we've been doing with all the union-like types.
| will be updated to create an appropriate `Workload` objects themselves whenever they can appropriately infer | ||
| the intention from the desired state. | ||
| Note that given scheduling options are stored in the `Workload` object, pods linked to the `Workload` | ||
| object will not be scheduled until this `Workload` object is created and observed by the kube-scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still having a hard time with these ideas. If you control the whole Job object, why would you want to differentiate between those two fields (one at job spec level, and the other at pod template level) which as an author of a job have a full control of? What's the value of duplicating this information?
| will be updated to create an appropriate `Workload` objects themselves whenever they can appropriately infer | ||
| the intention from the desired state. | ||
| Note that given scheduling options are stored in the `Workload` object, pods linked to the `Workload` | ||
| object will not be scheduled until this `Workload` object is created and observed by the kube-scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to push this such that for alpha, you only leave the initial part talking about user-managed Workload resource.
| 1. Ensure that pods being part of a gang are not bound if all pods belonging to it can't be scheduled. | ||
| 2. Provide the "optimal enough" placement by considering all pods from a gang together. | ||
| 3. Avoid deadlock scenario when multiple workloads are being scheduled at the same time by kube-scheduler. | ||
| 4. Avoid deadlock scenario when multiple workloads are being scheduled at the same time by different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason you have this element mentioned twice? Why can't you just write this as:
| 4. Avoid deadlock scenario when multiple workloads are being scheduled at the same time by different | |
| 4. Avoid deadlock scenario when multiple workloads are being scheduled at the same time by any scheduler (kube-scheduler, or third-party provided scheduler). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - the reason is that these problems have a bit different solutions and will be resolved at different timeframe (as described below - addressing (3) is the requirement for Beta, but addressing (4) will require a lot of follow-up work (reservations-like approach)).
Because of that I'm going to leave it as is, to make referencing those problems below easier.
| will be updated to create an appropriate `Workload` objects themselves whenever they can appropriately infer | ||
| the intention from the desired state. | ||
| Note that given scheduling options are stored in the `Workload` object, pods linked to the `Workload` | ||
| object will not be scheduled until this `Workload` object is created and observed by the kube-scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in the long run, does that mean we'll be extending all of built-in controllers with similar changes? I'm having a hard time supporting this kind of additions across the board.
| ###### How can this feature be enabled / disabled in a live cluster? | ||
|
|
||
| - [X] Feature gate (also fill in values in `kep.yaml`) | ||
| - Feature gate name: Workload/GenericWorkload/NativeWorkload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kep.yaml talks about Workload, would be nice to pick one and use it consistently across all places 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to GenericWorkload for now - although we may adjust the name during the coding.
| will rollout across nodes. | ||
| --> | ||
|
|
||
| ###### What specific metrics should inform a rollback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although not required, have you considered what kind of metrics you could expose for this functionality and where? I'm thinking about exposing gang-related metrics in the scheduler, as the most appropriate place to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect any particular metrics for the Workload API itself (we should rather rely on standard kube-apiserver metrics for that).
But for GangScheduling feature, we absolutely need to expose some metrics. The two primary ones I have on my mind is:
- error rate - number of times we need to reject the placement that was already computed because of not being able to satisfy the whole gang
- e2e latency for the whole gang
I just didn't add them from the beginning as we may want to tweak it somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking primarily about the gangscheduling feature, since like most of our scheduling code it's hard to follow from outside. I like both of the proposed metrics. Now I need to only remember that you wrote them here for beta 😅
wojtek-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
| Moreover, the visibility into issues (debuggability) will depend on [#5510], but we don't | ||
| treat it as a blocker. | ||
|
|
||
| [#5510]: https://github.com/kubernetes/enhancements/pull/5510 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| 1. Ensure that pods being part of a gang are not bound if all pods belonging to it can't be scheduled. | ||
| 2. Provide the "optimal enough" placement by considering all pods from a gang together. | ||
| 3. Avoid deadlock scenario when multiple workloads are being scheduled at the same time by kube-scheduler. | ||
| 4. Avoid deadlock scenario when multiple workloads are being scheduled at the same time by different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - the reason is that these problems have a bit different solutions and will be resolved at different timeframe (as described below - addressing (3) is the requirement for Beta, but addressing (4) will require a lot of follow-up work (reservations-like approach)).
Because of that I'm going to leave it as is, to make referencing those problems below easier.
| ###### How can this feature be enabled / disabled in a live cluster? | ||
|
|
||
| - [X] Feature gate (also fill in values in `kep.yaml`) | ||
| - Feature gate name: Workload/GenericWorkload/NativeWorkload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to GenericWorkload for now - although we may adjust the name during the coding.
| will rollout across nodes. | ||
| --> | ||
|
|
||
| ###### What specific metrics should inform a rollback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect any particular metrics for the Workload API itself (we should rather rely on standard kube-apiserver metrics for that).
But for GangScheduling feature, we absolutely need to expose some metrics. The two primary ones I have on my mind is:
- error rate - number of times we need to reject the placement that was already computed because of not being able to satisfy the whole gang
- e2e latency for the whole gang
I just didn't add them from the beginning as we may want to tweak it somehow.
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: | ||
| "metadata.annotations['batch.kubernetes.io/job-completion-index']" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erictune - I think you're referring to a different thing. There are two things:
- Having
ControllerRefinside Workload object (part of this KEP) - Adding
workloadRefin Job API (part of KEP-5547: Add workloadRef in the Job API #5548 - @helayoty KEP)
Your comment above shows why (1) makes sense. I think it's not critical, but I also see a value in it. But that's not the thing that is being questioned in this comment thread.
What is being questioned here is (2). Because as pointed by @soltysh (which I agreed with), introducing workloadRef in Job API is a duplication of information. Because the exact same thing can already be expressed through the WorkloadReference in the PodTemplateSpec.
So we will already have a way to get the information without the #5548:
(a) From Workload to Job - using the ControllerRef field being part of WorkloadSpec
(b) From Job to Workload - using the job.spec.podTemplate.WorkloadReference field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it now
/approve
/lgtm
Thanks for all the effort to kick off this initiative, I'm super excited to see it's finally coming!
I left one comment, but that is on the beta discussion section. I know that is not supposed to discuss deeply on this initial alpha KEP, and that's why I actually put the stamp.
| can process all gang pods together. The single scheduling cycle and blocking resources in beta | ||
| will address the requirement (3). | ||
|
|
||
| We will also introduce delayed preemption by moving it after `WaitOnPermit` phase. Together with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if a pod deletion takes long? The current scheduler mitigates it by just letting a preemptor pod go anywhere else (if there's an empty space, probably made by some other pod termination), while a deletion isn't completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great question - I believe that we will need a similar mechanism here too. This becomes problematic in case of cross-pod dependencies within a gang (e.g. collocation), so it requires putting more thought in it. So let's get back to this question after Alpha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WaitOnPreemption would have its own timeout, after which waiting pods would be rejected, so the process of preemption could start over after possible workload rescheduling first (if scheduler has any new information).
|
/hold |
To block the merge until @soltysh approves for PRR. |
| status: implementable | ||
| creation-date: 2025-09-17 | ||
| reviewers: | ||
| - TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we just put me only?
|
|
||
| owning-sig: sig-scheduling | ||
| participating-sigs: | ||
| - sig-apps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anyone from sig-apps here? Do we need a stamp from them? or is that not mandatory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I killed two birds with one stone by using @soltysh - he is wearing both PRR hat and the SIG apps hat :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I wear many hats 😅
soltysh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
the PRR
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sanposhiho, soltysh, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Cancelling hold based on above approvals. /hold cancel |
|
Thanks everyone for your effort! I am super excited to see this moving forward 🚀 |
| // PodGroup defines the name of the PodGroup within a Workload this pod belongs to. | ||
| PodGroup string | ||
| // PodGroupReplicaIndex is the replica index of the PodGroup that this pod | ||
| // belong to when the workload is running ReplicatedGangMode. In this mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ReplicatedGangMode/GangModeReplicated
First version of the Gang Scheduling KEP