Skip to content

Commit 5758be0

Browse files
committed
Expand on review comments
1 parent 0ff3958 commit 5758be0

File tree

1 file changed

+89
-29
lines changed
  • keps/sig-scheduling/5710-workload-aware-preemption

1 file changed

+89
-29
lines changed

keps/sig-scheduling/5710-workload-aware-preemption/README.md

Lines changed: 89 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,10 @@ and many others) and bring the true value for every Kubernetes user.
143143
(e.g. caused by hardware failures)
144144
- Design rescheduling for workloads that will be preempted (rescheduling will
145145
be addressed in a separate dedicated KEP)
146+
- Change the preemption principle of avoiding preemption if a workload/pod can be scheduled without it.
147+
If we decide to change that it will be addressed in a dedicated KEP.
148+
- Propose any tradeoff between preemption and cluster scale-up.
149+
- Design workload-level preemption triggerred by external schedulers
146150

147151
## Proposal
148152

@@ -317,30 +321,78 @@ object (Workload, PodGroup, PodSubGroup, ...) corresponding to this pod.
317321

318322

319323
There is one direct implication of the above - the `pod.Spec.PriorityClassName` and `pod.Spec.Priority`
320-
may no longer reflect the actual pod priority. This can be misleading to users.
324+
may no longer reflect the actual pod priority, which could be misleading to users.
321325

322326
```
323327
<<[UNRESOLVED priority divergence]>>
324328
There are several options we can approach it (from least to most invasive):
325-
- Explain via documentation
326-
- Validating that if a pod is referencing a workload, `pod.Spec.PriorityClassName` equals
327-
`workload.Spec.PriorityClassName`. However, `Workload` object potentially may not exist
328-
yet on pod creation.
329-
- Making `pod.Spec.PriorityClassName` and `pod.Spec.Priority` mutable fields and having a
330-
controller responsible for reconciling these. However, that doesn't fully address the
331-
problems as divergence between the pod and PodTemplate in true workload object could also
332-
be misleading.
333-
334-
The validation option seems like the best option, if we can address the problem of not-yet
335-
existing `Workload` object (reversed validation?).
329+
- Describe the possible divergence via documentation
330+
- Expose the information about divergence in the API.
331+
This would require introducing a new `Conditions` field in `workload.Status` and introducing
332+
a dedicated condition like `PodsNotMatchingPriority` that will be set by either kube-scheduler
333+
or a new workload-controller whenever it observes pods referencing a given `Workload` object
334+
which priority doesn't match the priority of the workload object.
335+
- Introducing an admission to validate that if a pod is referencing a workload object, its
336+
`pod.Spec.PriorityClassName` equals `workload.Spec.PriorityClassName`. However, we allow creating
337+
pods before the workload object, and there don't see, to be an easy way to avoid races.
338+
- Making `pod.Spec.PriorityClassName` and `pod.Spec.Priority` mutable fields and having a new
339+
workload controller responsible for reconciling these. However, that could introduce another
340+
divergence between the priority of pods and the priority defined in the PodTemplate in true
341+
workload objects which would introduce a similar level of confusion to users.
342+
343+
If we could address the race in validations, that seems like a desired option. However,
344+
I don't see an easy option for it.
345+
Given that, we suggest to proceed with just exposing the information about divergence in the
346+
Workload status (second option) and potentially improving it later.
336347
<<[/UNRESOLVED]>>
337348
```
338349

339-
The similar argument holds for preemption priority, but we argue that its mutable nature
340-
makes it infeasible for reconciling this information back to pod for scalability reasons
341-
(we can absolutely handle frequent updates to `Workload.Spec.PreemptionPriorityClassName`
342-
but we can't handle updating potentially hundreds of thousands of pods within that workload
343-
that frequently). In this case, we limit ourselves to documentation.
350+
It's worth mentioning here, that we want to introduce the same defaulting rules for
351+
`workload.Spec.PriorityClassName` that we have for pods. Namely, if `PriorityClassName` is unset
352+
and there exists PriorityClass marked as `globalDefault`, we default it to that value.
353+
This consistency will allow us to properly handle when users are not setting neither pods
354+
nor workload priorities.
355+
Similarly, we will ensure that `PriorityClass.preemptionPolicy` works exactly the same way for
356+
workloads as for pods. Such level of consistency would make adoption of Workload API much easier.
357+
358+
Moving to `PreemptionPriorityClassName`, the same issue of confusion holds (the actual priority
359+
set at the pod level may not reflect priority used for preemption). We argue that its mutable
360+
nature makes it infeasible for reconsiling this information back to pods for scalability reasons
361+
(we can absolutely handle frequent updates to `Workload.Spec.PreemptionPriorityClassName`,
362+
but we can't handle updating potentially hundreds or thousands of pods within that workload
363+
that frequently). So in this case, we limit ourselves to documentation.
364+
365+
```
366+
<<[UNRESOLVED preemption cycles]>>
367+
If we would allow for arbitrary relation between scheduling priority and preemption policy,
368+
we could hit an infinite cycle of preemption. Consider an example when:
369+
- workload A has scheduling priority `high` and preemption policy `low`
370+
- workload B has scheduling priority `high` and preemption policy `low`
371+
In such case, workload A can preempt workload B (`high` > `low`), but then workload B can
372+
also preempt workload A. This is definitely not desired.
373+
We can avoid the infinite cycle by ensuring that `scheduling priority <= preemption priority`.
374+
375+
However, this also opens a question if we should allow for setting arbitrary high preemption
376+
priority for low scheduling priority workloads. Arguably we can claim that scheduling priority
377+
should be the ultimate truth and if there is a workload with higher priority it should be
378+
able to preempt it.
379+
So the alternative model that we can consider is to instead adding the concept of preemption
380+
priority, introduce a concept of "preemption cost". In such a model, the workload with
381+
higher priority can always preempt lower priority ones, but if we need to choose between
382+
two workloads to preempt, such preemption cost may result in choosing the one with higher
383+
priority amongst these two. Consider the following example:
384+
- we want to schedule workload A with scheduling priority `high`
385+
- it needs to preempt one of the already running workloads
386+
- workload B has scheduling priority `med` but preemption cost `low`
387+
- workload C has scheduling priority `low` but preemption cost `high`
388+
In such case, the preemption cost would result in choosing workload B for preemption. But
389+
if it gets recreated, it will preempt workload C causing unnecessary cascading preemption.
390+
This is the reason why a cost-based model was discarded.
391+
392+
So for now, we suggest introducing only additional validation of scheduling priority to be
393+
not higher then preemption policy.
394+
<<[/UNRESOLVED]>>
395+
```
344396

345397
```
346398
<<[UNRESOLVED priority status]>>
@@ -356,6 +408,7 @@ We should introduce/describe `workload.status` to reflect:
356408
We start with describing at the high-level how existing pod-level preemption algorithm works.
357409
Below, we will show how to generalize it to workloads.
358410

411+
If a pod P can be scheduled without triggering preemption, we don't consider preemption at all.
359412
To check if a pod P can be scheduled on a given node with preemption we:
360413

361414
1. Identify the list of potential victims - all running pods with priority lower than the new pod P.
@@ -368,8 +421,8 @@ To check if a pod P can be scheduled on a given node with preemption we:
368421
1. From remaining potential victims, we start to reprieve pods starting from the highest priority
369422
and working down until the set of remaining victims still keeps the node feasible.
370423

371-
Once we compute the feasibility and list of victims for all nodes, we score that and choose the
372-
best options.
424+
Once we find enough nodes feasible for preemption and list of victims for them, we score that and
425+
choose the best options.
373426

374427
The above algorithm achieves our principles, as by eliminating highest priority pods first, it
375428
effectively tries to minimize the cascading preemptions later.
@@ -380,13 +433,13 @@ moving to the level of `Workload`, but also no longer operating at the level of
380433
We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
381434
becomes a challenge, thus we modify to the approach below.
382435

383-
To check if a workload W can be scheduled on a given cluster with preemption we:
436+
To check if a (gang) PodGroup G can be scheduled on a given cluster with preemption we:
384437

385438
1. Identify the list of potential victims:
386-
- all running workloads with (preemption) priority lower than the new workload W
387-
- all individual pods (not being part of workloads) with priority lower than the new workload W
439+
- all running workloads with (preemption) priority lower than the new pod group G
440+
- all individual pods (not being part of workloads) with priority lower than the new pod group G
388441

389-
1. If removing all the potential victims would not make the new workload W schedulable,
442+
1. If removing all the potential victims would not make the new pod group G schedulable,
390443
the workload is unschedulable even with preemption.
391444

392445
```
@@ -402,15 +455,22 @@ with N being number of workload/pods violating PDB.
402455

403456
1. For remaining potential victims, using binary search across priorities find the minimal priority P
404457
for which scheduling the new workload W doesn't require preempting any workloads and/or pods with
405-
priority higher than P. This allows to reduce the potential cascading preemptions later.
458+
priority higher than P. This allows to reduce the potential cascading prieemptions later.
459+
460+
1. After eliminating all workloads and pod with priority higher than P (computed above) from the
461+
potential victims list:
462+
463+
1. assume that all those potential victims are removed from the cluster and schedule new pod group
464+
G with that assumption
465+
1. sort the potential victims to reflect their "importance" (tentative proposal - sort first by
466+
their priority, within a single priority prefer workloads)
467+
1. go over the list of potential victims in the above order checking if they can be placed
468+
where they are currently running. If so assume it back and remove from potential victims list.
406469

407470
```
408471
<<[UNRESOLVED minimizing preemptions]>>
409-
The following algorithm is by far no optimal, but is simple to reason about and I would suggest it as
410-
a starting point:
411-
- assume that all potential victims on the list are removed and schedule the new workload W
412-
- go over the remaining potential victims starting from the highest priority and check if these can
413-
be placed in the place they are currently running; if so remove from the potential victims
472+
The above algorithm is by far no optimal, but is simple to reason about and I would suggest it as
473+
a starting point.
414474
415475
As a bonus we may consider few potential placements of the new workload W here and choose the one that
416476
somehow optimizes the number of victims. But that will become more critical once we get to

0 commit comments

Comments
 (0)