@@ -143,6 +143,10 @@ and many others) and bring the true value for every Kubernetes user.
143143 (e.g. caused by hardware failures)
144144- Design rescheduling for workloads that will be preempted (rescheduling will
145145 be addressed in a separate dedicated KEP)
146+ - Change the preemption principle of avoiding preemption if a workload/pod can be scheduled without it.
147+ If we decide to change that it will be addressed in a dedicated KEP.
148+ - Propose any tradeoff between preemption and cluster scale-up.
149+ - Design workload-level preemption triggerred by external schedulers
146150
147151## Proposal
148152
@@ -317,30 +321,78 @@ object (Workload, PodGroup, PodSubGroup, ...) corresponding to this pod.
317321
318322
319323There is one direct implication of the above - the ` pod.Spec.PriorityClassName ` and ` pod.Spec.Priority `
320- may no longer reflect the actual pod priority. This can be misleading to users.
324+ may no longer reflect the actual pod priority, which could be misleading to users.
321325
322326```
323327<<[UNRESOLVED priority divergence]>>
324328There are several options we can approach it (from least to most invasive):
325- - Explain via documentation
326- - Validating that if a pod is referencing a workload, `pod.Spec.PriorityClassName` equals
327- `workload.Spec.PriorityClassName`. However, `Workload` object potentially may not exist
328- yet on pod creation.
329- - Making `pod.Spec.PriorityClassName` and `pod.Spec.Priority` mutable fields and having a
330- controller responsible for reconciling these. However, that doesn't fully address the
331- problems as divergence between the pod and PodTemplate in true workload object could also
332- be misleading.
333-
334- The validation option seems like the best option, if we can address the problem of not-yet
335- existing `Workload` object (reversed validation?).
329+ - Describe the possible divergence via documentation
330+ - Expose the information about divergence in the API.
331+ This would require introducing a new `Conditions` field in `workload.Status` and introducing
332+ a dedicated condition like `PodsNotMatchingPriority` that will be set by either kube-scheduler
333+ or a new workload-controller whenever it observes pods referencing a given `Workload` object
334+ which priority doesn't match the priority of the workload object.
335+ - Introducing an admission to validate that if a pod is referencing a workload object, its
336+ `pod.Spec.PriorityClassName` equals `workload.Spec.PriorityClassName`. However, we allow creating
337+ pods before the workload object, and there don't see, to be an easy way to avoid races.
338+ - Making `pod.Spec.PriorityClassName` and `pod.Spec.Priority` mutable fields and having a new
339+ workload controller responsible for reconciling these. However, that could introduce another
340+ divergence between the priority of pods and the priority defined in the PodTemplate in true
341+ workload objects which would introduce a similar level of confusion to users.
342+
343+ If we could address the race in validations, that seems like a desired option. However,
344+ I don't see an easy option for it.
345+ Given that, we suggest to proceed with just exposing the information about divergence in the
346+ Workload status (second option) and potentially improving it later.
336347<<[/UNRESOLVED]>>
337348```
338349
339- The similar argument holds for preemption priority, but we argue that its mutable nature
340- makes it infeasible for reconciling this information back to pod for scalability reasons
341- (we can absolutely handle frequent updates to ` Workload.Spec.PreemptionPriorityClassName `
342- but we can't handle updating potentially hundreds of thousands of pods within that workload
343- that frequently). In this case, we limit ourselves to documentation.
350+ It's worth mentioning here, that we want to introduce the same defaulting rules for
351+ ` workload.Spec.PriorityClassName ` that we have for pods. Namely, if ` PriorityClassName ` is unset
352+ and there exists PriorityClass marked as ` globalDefault ` , we default it to that value.
353+ This consistency will allow us to properly handle when users are not setting neither pods
354+ nor workload priorities.
355+ Similarly, we will ensure that ` PriorityClass.preemptionPolicy ` works exactly the same way for
356+ workloads as for pods. Such level of consistency would make adoption of Workload API much easier.
357+
358+ Moving to ` PreemptionPriorityClassName ` , the same issue of confusion holds (the actual priority
359+ set at the pod level may not reflect priority used for preemption). We argue that its mutable
360+ nature makes it infeasible for reconsiling this information back to pods for scalability reasons
361+ (we can absolutely handle frequent updates to ` Workload.Spec.PreemptionPriorityClassName ` ,
362+ but we can't handle updating potentially hundreds or thousands of pods within that workload
363+ that frequently). So in this case, we limit ourselves to documentation.
364+
365+ ```
366+ <<[UNRESOLVED preemption cycles]>>
367+ If we would allow for arbitrary relation between scheduling priority and preemption policy,
368+ we could hit an infinite cycle of preemption. Consider an example when:
369+ - workload A has scheduling priority `high` and preemption policy `low`
370+ - workload B has scheduling priority `high` and preemption policy `low`
371+ In such case, workload A can preempt workload B (`high` > `low`), but then workload B can
372+ also preempt workload A. This is definitely not desired.
373+ We can avoid the infinite cycle by ensuring that `scheduling priority <= preemption priority`.
374+
375+ However, this also opens a question if we should allow for setting arbitrary high preemption
376+ priority for low scheduling priority workloads. Arguably we can claim that scheduling priority
377+ should be the ultimate truth and if there is a workload with higher priority it should be
378+ able to preempt it.
379+ So the alternative model that we can consider is to instead adding the concept of preemption
380+ priority, introduce a concept of "preemption cost". In such a model, the workload with
381+ higher priority can always preempt lower priority ones, but if we need to choose between
382+ two workloads to preempt, such preemption cost may result in choosing the one with higher
383+ priority amongst these two. Consider the following example:
384+ - we want to schedule workload A with scheduling priority `high`
385+ - it needs to preempt one of the already running workloads
386+ - workload B has scheduling priority `med` but preemption cost `low`
387+ - workload C has scheduling priority `low` but preemption cost `high`
388+ In such case, the preemption cost would result in choosing workload B for preemption. But
389+ if it gets recreated, it will preempt workload C causing unnecessary cascading preemption.
390+ This is the reason why a cost-based model was discarded.
391+
392+ So for now, we suggest introducing only additional validation of scheduling priority to be
393+ not higher then preemption policy.
394+ <<[/UNRESOLVED]>>
395+ ```
344396
345397```
346398<<[UNRESOLVED priority status]>>
@@ -356,6 +408,7 @@ We should introduce/describe `workload.status` to reflect:
356408We start with describing at the high-level how existing pod-level preemption algorithm works.
357409Below, we will show how to generalize it to workloads.
358410
411+ If a pod P can be scheduled without triggering preemption, we don't consider preemption at all.
359412To check if a pod P can be scheduled on a given node with preemption we:
360413
3614141 . Identify the list of potential victims - all running pods with priority lower than the new pod P.
@@ -368,8 +421,8 @@ To check if a pod P can be scheduled on a given node with preemption we:
3684211 . From remaining potential victims, we start to reprieve pods starting from the highest priority
369422 and working down until the set of remaining victims still keeps the node feasible.
370423
371- Once we compute the feasibility and list of victims for all nodes , we score that and choose the
372- best options.
424+ Once we find enough nodes feasible for preemption and list of victims for them , we score that and
425+ choose the best options.
373426
374427The above algorithm achieves our principles, as by eliminating highest priority pods first, it
375428effectively tries to minimize the cascading preemptions later.
@@ -380,61 +433,86 @@ moving to the level of `Workload`, but also no longer operating at the level of
380433We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
381434becomes a challenge, thus we modify to the approach below.
382435
383- To check if a workload W can be scheduled on a given cluster with preemption we:
384-
385- 1 . Identify the list of potential victims:
386- - all running workloads with (preemption) priority lower than the new workload W
387- - all individual pods (not being part of workloads) with priority lower than the new workload W
388-
389- 1 . If removing all the potential victims would not make the new workload W schedulable,
390- the workload is unschedulable even with preemption.
391-
392- ```
393- <<[UNRESOLVED PodDisruptionBudget violations]>>
394- How critical is reprieving workloads and pods violating PodDisruptionBudgets? We no longer can
395- afford full workload scheduling trying to reprieve every individual pod and workload.
396-
397- We could consider finding the first one that can't be reprieved using binary search, but if we
398- can't reprieve any of those, learning about that would require O(N) full workload schedulings
399- with N being number of workload/pods violating PDB.
400- <<[/UNRESOLVED]>>
401- ```
402-
403- 1 . For remaining potential victims, using binary search across priorities find the minimal priority P
404- for which scheduling the new workload W doesn't require preempting any workloads and/or pods with
405- priority higher than P. This allows to reduce the potential cascading preemptions later.
406-
407- ```
408- <<[UNRESOLVED minimizing preemptions]>>
409- The following algorithm is by far no optimal, but is simple to reason about and I would suggest it as
410- a starting point:
411- - assume that all potential victims on the list are removed and schedule the new workload W
412- - go over the remaining potential victims starting from the highest priority and check if these can
413- be placed in the place they are currently running; if so remove from the potential victims
414-
415- As a bonus we may consider few potential placements of the new workload W here and choose the one that
416- somehow optimizes the number of victims. But that will become more critical once we get to
417- Topology-Aware-Scheduling and I would leave that optimization until then.
418- <<[/UNRESOLVED]>>
419- ```
420-
421- ```
422- <<[UNRESOLVED sharing algorithms]>>
423- The remaining question is to what extent we want to unify the preemption mechanism across
424- pod-triggerred (existing algorithm) and workload-triggerred preemption.
425-
426- It might be tempting to start with a dedicated new implementation to reduce the risk. But the above
427- proposal was structured such way to facilitate sharing:
428- - once the new workload W is placed, going over the remaining potential victims and trying to
429- place them where they are currently running, is pretty much exactly what the current algorithm is
430- doing
431- - considering "few potential placements" in the pod-triggerred case can be used as "try every node"
432- so effectively it's also the existing algorithm (just viewed from a slightly different angle
433-
434- So I would actually argue to we should refactor the existing preemption code and use that in both
435- cases.
436- <<[/UNRESOLVED]
437- ```
436+ At the same time, we need to support four cases:
437+ - individual pod as preemptor, individual pod(s) as victim(s)
438+ - individual pod as preemptor, pod group(s) (and individual pod(s)) as victim(s)
439+ - pod group as preemptor, individual pod(s) as victim(s)
440+ - pod group as preemptor, pod group(s) (and individual pod(s)) as victim(s)
441+
442+ To achieve that, we don't want to multiply preemption algorithms and rather want to have a
443+ unified high-level approach (with potential minor tweaks per option).
444+
445+ To check if a given preemptor (either (gang) PodGroup G or an individual pod P) can be scheduled
446+ with preemption:
447+
448+ 1 . Split the cluster into mutually-exclusive domains where a preemptor will be put:
449+ - for pod P, it will always be individual nodes
450+ - for pod group G, we will start with just one "whole cluster"; eventually once we will have
451+ topology-aware scheduling, we will most probably inject some domain-based split here
452+
453+ 1 . For every domain computed above run the following points:
454+
455+ 1 . Identify the list of all potential victims in that domain:
456+ - all running workloads with (preemption) priority lower then preemptor priority; note that
457+ some pods from that workload may be running outside of currently considered domain D - they
458+ need to contribute to scoring, but they won't contribute to feasibility of domain D.
459+ - all individual pods with priority lower the preemptor priority
460+
461+ 1 . If removing all potential victims would not make the preemptor schedulable, the preemptor
462+ is unschedulable with preemption in currently considered domain D.
463+
464+ 1 . Sort all the potential victims to reflect their "importance" (from the most important to the
465+ least ones). Tentatively, the function will sort first by their priority, and within a single
466+ priority prioritizing workloads over individual pods.
467+
468+ 1 . Perform best-effort reprieval of workloads and pods violating PodDisruptionBudgets. We achieve
469+ it but scheduling and assuming the preemptor (assuming that all potential victims are removed),
470+ and then iterating over potential victims that would violate PodDisruptionBudget to check if
471+ these can be placed in the exact same place they are running now. If they can we simply leave
472+ them where they are running now and remove from the potential victims list.
473+
474+ ```
475+ <<[UNRESOLVED PodDisruptionBudget violations]>>
476+ The above reprieval works identically to current algorithm if the domain D is a single node.
477+ For larger domains, different placements of a preemptor are potentially possible and may result
478+ in potentially different sets of victims violating PodDisruptionBudgets to remain feasible.
479+ This means that the above algorithm is not optimizing for minimizing the number of victims that
480+ would violate their PodDisruptionBudgets.
481+ However, we claim that algorithm optimizing for it would be extremely expensive computationally
482+ and propose to stick with this simple version at least for a foreseable future.
483+ <<[/UNRESOLVED]
484+ ```
485+
486+ 1. For the remaining potential victims, using binary search across priorities find the minimal
487+ priority N for which scheduling the preemptor can be achieved without preempting any victims
488+ with priority higher than N. This allows to reduce the potential cascaiding preemptions later.
489+
490+ 1. Eliminate all victims from the potential victims list that have priority higher than N.
491+
492+ 1. Schedule and assume the preemptor (assuming that all remaining potential victims are removed).
493+
494+ 1. Iterate over the list of potential victims (in the order achieved with sorting above) checking
495+ if they can be placed where they are currently running. If so assume it back and remove from
496+ potential victims list.
497+
498+ ```
499+ <<[ UNRESOLVED minimizing preemptions] >>
500+ The above algorithm is definitely non optimal, but is (a) compatible with the current pod-based
501+ algorithm (b) computationally feasible (c) simple to reason about.
502+ As a result, I suggest that we proceed with it at least as a starting point.
503+
504+ As a bonus we may consider few potential placements of the preemptor and choose the one that
505+ somehow optimizes the number of victims. However, that will appear to be more critical once we
506+ get to Topology-Aware-Scheduling and I would leave that improvement until then.
507+ <<[ /UNRESOLVED] >>
508+ ```
509+
510+ 1. We score scheduling decisions for each of the domains and choose the best one. The exact criteria
511+ for that will be figured out during the implementation phase.
512+
513+ It's worth noting that as structured, this algorithm addresses all four cases mentioned above that
514+ we want to support and is compatible with the current pod-based preemption algorithm. This means
515+ we will be able to achieve in-place replacement with relatively localized changes.
438516
439517### Delayed preemption
440518
@@ -445,10 +523,15 @@ Should we leave it as part of this KEP or should this be moved to the Gang-Sched
445523```
446524
447525As part of minimizing preemptions goal, arguably the most important thing to do is to avoid unnecessary
448- preemptions. However, this is not true for the current gang scheduling implementation.
449- In the current implementation, preemption is triggered in the ` PostFiler ` . However, it's entirely
450- possible that a given pod may actually not even proceed to binding, because we can't schedule the
451- whole gang. In such case, the preemption ended up being a completely unnecessary disruption.
526+ preemptions. However, with the current model of preemption when preemption is triggered immediately
527+ after the victims are decided (in `PostFilter`) doesn't achieve this goal. The reason for that is
528+ that the proposed placement (nomination) can actually appear to be invalid and not be proceeded with.
529+ In such case we will not even proceed to binding and the preemption will be completely unnessary
530+ disruption.
531+ Note that this problem already exists in the current gang scheduling implementation. A given gang may
532+ not proceed with binding if the `minCount` pods from it can't be scheduled. But the preemptions are
533+ currently triggerred immediately after choosing a place for individual pods. So similarly as above,
534+ we may end up with completely unnecessary disruptions.
452535
453536We will address it with what we call `delayed preemption` mechanism as following:
454537
0 commit comments