-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[KEP-5710]: Workload-aware preemption KEP #5711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
672aa68 to
ce04eca
Compare
ce04eca to
0ff3958
Compare
| 1. Identify the list of potential victims: | ||
| - all running workloads with (preemption) priority lower than the new workload W | ||
| - all individual pods (not being part of workloads) with priority lower than the new workload W |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having two independent priorities for a workload - one for scheduling and one for the preemption or the single preemption priority which can be dynamically updated can potentially lead to a cycle in the preemption.
Let's assume that we have an existing workload A with high scheduling priority and low preemption priority running in a cluster.
Now let's assume that we want to schedule a workload B which has medium scheduling priority and medium preemption priority.
Workload B will preempt workload A and will start to run because its scheduling priority > preemption priority of the workload A.
However when workload A will restart and it will be rescheduled it will preempt workload B and will start to run because its scheduling priority > preemption priority of workload B.
The same issue can happen if we will have only one priority but this priority will be reduced while the workload is running. After preemption when the workload will reappear with the original higher priority it can preempt the workload which has preempted it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential solution / mitigation to the described problem could be stating that preemption priority >= scheduling priority. This way after restarting the preempted workload will not be able to preempt the preemptor workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for point that out!
Yeah - "preemption priority >= scheduling priority" is definitely desired. I don't think we have any usecases that would benefit from the reversed.
That said, I need to think a bit more if that is enough. I think it prevents the cycles if we assume static priorities, but it can still potentially trigger cycles if the priorities will be changing. OTOH, if the priorities are changing this is probably desired.
Let me think about it a bit more and I will update the KEP to reflect the thoughts later this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I have added an unresolved section about that to the Workload priorities section above describing the problem, potential solution and alternatives. Let's continue the discussion there.
|
/assign |
erictune
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see this, and I like how it is decoupled from the other work planned for 1.36.
| can't reprieve any of those, learning about that would require O(N) full workload schedulings | ||
| with N being number of workload/pods violating PDB. | ||
| <<[/UNRESOLVED]>> | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's assume that nodes are either high-pod-per node count, or low pod-per-node count. Its a bimodal distribution.
Let's further assume that if Gang scheduling is used, then the node is going to usually be low pod-per-node count.
So, then we can do the following:
- Individual Pod as preemptor - assume high pod-per-node, use current algorithm, which is optimized for many pods per node, consider all victims.
- Gang as preemptor - assume low pod-per-node in all cases, consider a maximum of e.g. 4 reprieves per node, to keep compute time down, and just stop reprieving in the case where there are more things on the node.,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every split in the algorith/code path makes it harder to reason about. This is why I'm trying to avoid that whenever possible.
Additionally, while I agree with you that in majority of cases it will be true, there are definitely usecases where people run gang workloads with many pods per node. So in my opinion the split as proposed could potentially result in decisions that would be really far from the optimal ones.
In the spirit of trying to simplify and unify stuff as much as possible I actually adjusted the algorithm so that we can have a single scheme that addresses all four usecases that we have. I think this is much better option.
PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/cc |
|
|
||
| ``` | ||
| <<[UNRESOLVED delayed preemption]>> | ||
| Should we leave it as part of this KEP or should this be moved to the Gang-Scheduling one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that it should be moved to another KEP, I feel that it is completely independent of workload aware preemption and can work just with the current preemption + gang scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale behind having it here was that it also serves the goal of "reducing disruptions".
I think there are two primary options:
- keep it here and reference from "workload KEP"
- move it to "workload KEP" and reference from here
I'm happy with either options based on what is the preference of majority.
| 1. From remaining potential victims, we start to reprieve pods starting from the highest priority | ||
| and working down until the set of remaining victims still keeps the node feasible. | ||
|
|
||
| Once we compute the feasibility and list of victims for all nodes, we score that and choose the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it's possible that we will not do that for all nodes in the cluster. We find feasible nodes until we have max(numNodes * 0.1, 100) nodes from which we can choose from: https://github.com/kubernetes/kubernetes/blob/ec1bf8a4f3a5f054065225dc8275c66b93310d17/pkg/scheduler/framework/preemption/preemption.go#L363-L364
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - updated (although I don't think it changes anything for this particular proposal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not for the initial implementation but it's worth to keep it in mind once we look into the scalability of workload preemption
| - all running workloads with (preemption) priority lower than the new workload W | ||
| - all individual pods (not being part of workloads) with priority lower than the new workload W | ||
|
|
||
| 1. If removing all the potential victims would not make the new workload W schedulable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should point out that this depends on workload aware scheduling which is not yet implemented and is planned for 1.36.
| 1. If removing all the potential victims would not make the new workload W schedulable, | ||
| the workload is unschedulable even with preemption. | ||
|
|
||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you need to indent this "code block" to keep the numbering continuous.
|
|
||
| 1. Identify the list of potential victims: | ||
| - all running workloads with (preemption) priority lower than the new workload W | ||
| - all individual pods (not being part of workloads) with priority lower than the new workload W |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there is a workload and an individual pod, where only one is needed to make the new workload schedulable. Which one will be chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should choose pod, but I don't have super strong preference. I added a point about sorting to reflect that but I'm happy to take any suggestions there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if they have the same priority then: single pod > pod from workload with gang preemtable = false > workload with gang preemtable = true?
| 1. Extend `SchedulingFramework` with two new steps: `RunGetResourcesPlugins` and | ||
| `WaitForGetResources`. These will be called immediately after `WaitOnPermit` phase and | ||
| before running `RunPreBindPlugins`. The `RunGetResourcesPlugins` will simply be calling | ||
| `GetResources` methods from all plugins implementing it. And `WaitForGetResources` will | ||
| work similarly to `WaitOnPermit`, serving as a barrier to ensure all the resources are | ||
| already available to use. The implementation will work similarly to `WaitOnPermit` to | ||
| ensure that `GetResources` was executed for all pods from within a `PodGroup`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the preemption targets be released when we after all don't run the RunGetResourcesPlugins? For example, when a gang turns out being unschedulable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very good question. I think we want something conceptually similar to "Reserve/Unreserve" pattern from DRA.
So scheduling phase will effectively serve as "reserve" phase and we we will have a sibling method of "unschedule" that will be able to re-assume the victims.
It requires some description though.
| We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient | ||
| becomes a challenge, thus we modify to the approach below. | ||
|
|
||
| To check if a workload W can be scheduled on a given cluster with preemption we: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we talk about a "gang pod group" rather than a "workload"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong opinion here - let me change it.
|
Do we want to add as a part of this KEP a description of how the preemption fits the workload aware scheduling (codewise)? Or do we want to have this other way around, have the KEP for workload aware scheduling reference this one when talking about preemption? In the gang scheduling KEP we talk about adding a "Workload" phase where we will end up with a pods from Gang with a nominated node names. I assume that this preemption will be a part of this phase. The open question is what actually will be the outcome of the preemption:
|
|
|
||
| As part of minimizing preemptions goal, arguably the most important thing to do is to avoid unnecessary | ||
| preemptions. However, this is not true for the current gang scheduling implementation. | ||
| In the current implementation, preemption is triggered in the `PostFiler`. However, it's entirely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reasoning here is that we want delayed preemption because it helps with the current gang scheduling implementation. But I believe that actually in this doc we could describe why we need it in terms of the workload preemption and IIUC this is to have an option to run workload preemption as part of the workload scheduling without immediately actuating the preemptions.
I added this also in a PR discussion, I think it would be beneficial to have a section on what will be the outcome of workload preemption and if it does not actuate the preemptions, what actually will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reasoning here is that we want delayed preemption because it helps with the current gang scheduling implementation. But I believe that actually in this doc we could describe why we need it in terms of the workload preemption and IIUC this is to have an option to run workload preemption as part of the workload scheduling without immediately actuating the preemptions.
Great point - I updated this paragraph to reflect that.
I added this also in a PR discussion, I think it would be beneficial to have a section on what will be the outcome of workload preemption and if it does not actuate the preemptions, what actually will do that.
I hope that an update KEP for gang scheduling that will describe the workload scheduling phase will be opened pretty soon and it will describe it. And I will be able to just link to it here :)
@macsko ^^
| 1. New field in the workload object (delayed preemption will not bring much value in | ||
| case of scheduling individual pods, though there would be significant benefit from | ||
| unification, so probably this isn't ideal option). | ||
| 1. Storing it in private kube-scheduler' structures (PodInfo for individual pods and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not allow external schedulers to use the same concept for victims nomination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to keep external schedulers out of scope for now - added explicitly to the non-goals section.
wojtek-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to address most of the comments, I will try to respond/address the remaining ones later today/tomorrow.
| 1. From remaining potential victims, we start to reprieve pods starting from the highest priority | ||
| and working down until the set of remaining victims still keeps the node feasible. | ||
|
|
||
| Once we compute the feasibility and list of victims for all nodes, we score that and choose the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - updated (although I don't think it changes anything for this particular proposal).
| 1. Identify the list of potential victims: | ||
| - all running workloads with (preemption) priority lower than the new workload W | ||
| - all individual pods (not being part of workloads) with priority lower than the new workload W |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I have added an unresolved section about that to the Workload priorities section above describing the problem, potential solution and alternatives. Let's continue the discussion there.
| We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient | ||
| becomes a challenge, thus we modify to the approach below. | ||
|
|
||
| To check if a workload W can be scheduled on a given cluster with preemption we: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong opinion here - let me change it.
|
|
||
| 1. Identify the list of potential victims: | ||
| - all running workloads with (preemption) priority lower than the new workload W | ||
| - all individual pods (not being part of workloads) with priority lower than the new workload W |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should choose pod, but I don't have super strong preference. I added a point about sorting to reflect that but I'm happy to take any suggestions there.
|
|
||
| ``` | ||
| <<[UNRESOLVED delayed preemption]>> | ||
| Should we leave it as part of this KEP or should this be moved to the Gang-Scheduling one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale behind having it here was that it also serves the goal of "reducing disruptions".
I think there are two primary options:
- keep it here and reference from "workload KEP"
- move it to "workload KEP" and reference from here
I'm happy with either options based on what is the preference of majority.
| 1. New field in the workload object (delayed preemption will not bring much value in | ||
| case of scheduling individual pods, though there would be significant benefit from | ||
| unification, so probably this isn't ideal option). | ||
| 1. Storing it in private kube-scheduler' structures (PodInfo for individual pods and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to keep external schedulers out of scope for now - added explicitly to the non-goals section.
| - workload C has scheduling priority `low` but preemption cost `high` | ||
| In such case, the preemption cost would result in choosing workload B for preemption. But | ||
| if it gets recreated, it will preempt workload C causing unnecessary cascading preemption. | ||
| This is the reason why a cost-based model was discarded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erictune - I thought a bit more about the idea of "preemption priority" vs "preemption cost" that we chatted offline.
I acknowledge the deficiencies of currently proposed model, but I think that the switching to preemption cost and just scoring-based approach will not prevent us from cascading preemptions, which we should really try to avoid.
I tried to update the KEP to reflect that - PTAL and I'm happy to chat more about it.
dom4ha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is what I was thinking of as well. The major change that this approach brings is that we no longer can say which pod determines victims, but rather which Workload/PodGroup determines them.
| <<[/UNRESOLVED]>> | ||
| ``` | ||
|
|
||
| 1. For remaining potential victims, using binary search across priorities find the minimal priority P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add one more step before we could consider this feature beta.
Once we've identified the minimum priority P, we should reschedule again all victims and leave those which were in fact not affected. I can't imagine we preempt workloads that are in fact not affected. So we need to introduce a concept of "workload rescheduling" with basic implementation that just checks whether a workload fits its current place or not).
There is another step which we indeed could consider an optimization (not a beta blocker). We can do reversed binary-search over workloads at the same priority (we need to have some secondary workload importance order) and try to schedule new workload leaving as many as possible existing workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note, I think in the scheduler codebase and this KEP we use term reprieve to say that we want to keep the pod nominated for preemption in its original place. In my mind rescheduling would mean trying to find another place for this pod.
| The following algorithm is by far no optimal, but is simple to reason about and I would suggest it as | ||
| a starting point: | ||
| - assume that all potential victims on the list are removed and schedule the new workload W | ||
| - go over the remaining potential victims starting from the highest priority and check if these can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do it only once we determined the minimum priority P and then try to re-schedule existing workloads, but not in every iteration. There is my separate comment about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we agreed with Wojtek that this was not an alternative but an additional enhancement to the original algorithm. After updates to the KEP it's in the algorithm.
5758be0 to
4293d98
Compare
wojtek-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have restructure the preemption algorithm here - I think that unifies and simplifies a lot of things.
| can't reprieve any of those, learning about that would require O(N) full workload schedulings | ||
| with N being number of workload/pods violating PDB. | ||
| <<[/UNRESOLVED]>> | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every split in the algorith/code path makes it harder to reason about. This is why I'm trying to avoid that whenever possible.
Additionally, while I agree with you that in majority of cases it will be true, there are definitely usecases where people run gang workloads with many pods per node. So in my opinion the split as proposed could potentially result in decisions that would be really far from the optimal ones.
In the spirit of trying to simplify and unify stuff as much as possible I actually adjusted the algorithm so that we can have a single scheme that addresses all four usecases that we have. I think this is much better option.
PTAL
4293d98 to
b24a962
Compare
erictune
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| preemption groups, but we leave that usecase as a future extension (it can be addressed when we | ||
| decide to extend Workload API with PodSubGroup concept - for more details see | ||
| [API Design For Gang and Workload-Aware Scheduling]). However, we never expect preemption unit to | ||
| be larger than scheduling unit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps define "Workload Portion" as one of {all pods in Workload, all pods in a PodGroup replica, a single Pod}.
Then define schedulingUnit and preemptionUnit as being Workload Portions.
This makes it clear that, for this KEP, we don't intend to support arbitrary lists of pods as scheduling or preemption units
| can't reprieve any of those, learning about that would require O(N) full workload schedulings | ||
| with N being number of workload/pods violating PDB. | ||
| <<[/UNRESOLVED]>> | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| We will address it with what we call `delayed preemption` mechanism as following: | ||
|
|
||
| 1. We will modify the `DefaultPreemption` plugin to just compute preemptions, without actuating those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and we advise maintainers of custom PostFilter implementations to do the same.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: erictune, wojtek-t The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
andreyvelich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wojtek-t, overall looks great!
I left a few questions.
| - Define the scheduler changes needed to implement workload-aware preemption | ||
| - Provide full backward compatibility for all existing scheduling features | ||
|
|
||
| ### Non-Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about partial preemption of a Workload?
I would imagine with DependsOn API in the JobSet that is something we should talk about at some point.
E.g. supporting Argo workflows in Kueue: kubernetes-sigs/kueue#74
| When running an AI Training job, I want to ensure that it will not be partially preempted. | ||
| If at least one my pods is not running, the others are not making progress anyway and are | ||
| just wasting the resources in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In elastic training scenarios things become more complex: kubeflow/trainer#2903
But I guess, we can re-iterate on this after initial implementation, right ?
| type GangSchedulingPolicy struct { | ||
| // Existing field(s). | ||
|
|
||
| // IsGangPreemptable defines whether all pods from this group should | ||
| // be preempted in all-or-nothing fashion. | ||
| IsGangPreemtable *bool | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we try to design an API that is future proof?
What if in the future we allow to partially preempt group of pods from gang for elastic training?
| This consistency will allow us to properly handle when users are not setting neither pods | ||
| nor workload priorities. | ||
| Similarly, we will ensure that `PriorityClass.preemptionPolicy` works exactly the same way for | ||
| workloads as for pods. Such level of consistency would make adoption of Workload API much easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that all limitations are applied for Workload preemption as well ?
https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#limitations-of-preemption
| // IsGangPreemptable defines whether all pods from this group should | ||
| // be preempted in all-or-nothing fashion. | ||
| IsGangPreemtable *bool | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered whether something similar to preemptionPolicy: Never makes sense for Workloads? Do we know whether there are use cases for a workload that should just wait for the place on the cluster without preempting other pods/workloads but it also requires the whole gang to start at once?
One-line PR description: First draft of Workload-aware preemption KEP
Issue link: Workload-aware preemption #5710