Expand on review comments

wojtek-t · wojtek-t · commit 5758be0bd288 · 2025-12-04T13:33:00.000+01:00
diff --git a/keps/sig-scheduling/5710-workload-aware-preemption/README.md b/keps/sig-scheduling/5710-workload-aware-preemption/README.md
@@ -143,6 +143,10 @@ and many others) and bring the true value for every Kubernetes user.
   (e.g. caused by hardware failures)
 - Design rescheduling for workloads that will be preempted (rescheduling will
   be addressed in a separate dedicated KEP)
+- Change the preemption principle of avoiding preemption if a workload/pod can be scheduled without it.
+  If we decide to change that it will be addressed in a dedicated KEP.
+- Propose any tradeoff between preemption and cluster scale-up.
+- Design workload-level preemption triggerred by external schedulers
 
 ## Proposal
 
@@ -317,30 +321,78 @@ object (Workload, PodGroup, PodSubGroup, ...) corresponding to this pod.
 
 
 There is one direct implication of the above - the `pod.Spec.PriorityClassName` and `pod.Spec.Priority`
-may no longer reflect the actual pod priority. This can be misleading to users.
+may no longer reflect the actual pod priority, which could be misleading to users.
 
 ```
 <<[UNRESOLVED priority divergence]>>
 There are several options we can approach it (from least to most invasive):
-- Explain via documentation
-- Validating that if a pod is referencing a workload, `pod.Spec.PriorityClassName` equals
-  `workload.Spec.PriorityClassName`. However, `Workload` object potentially may not exist
-  yet on pod creation.
-- Making `pod.Spec.PriorityClassName` and `pod.Spec.Priority` mutable fields and having a
-  controller responsible for reconciling these. However, that doesn't fully address the
-  problems as divergence between the pod and PodTemplate in true workload object could also
-  be misleading.
-
-The validation option seems like the best option, if we can address the problem of not-yet
-existing `Workload` object (reversed validation?).
+- Describe the possible divergence via documentation
+- Expose the information about divergence in the API.
+  This would require introducing a new `Conditions` field in `workload.Status` and introducing
+  a dedicated condition like `PodsNotMatchingPriority` that will be set by either kube-scheduler
+  or a new workload-controller whenever it observes pods referencing a given `Workload` object
+  which priority doesn't match the priority of the workload object.
+- Introducing an admission to validate that if a pod is referencing a workload object, its
+  `pod.Spec.PriorityClassName` equals `workload.Spec.PriorityClassName`. However, we allow creating
+  pods before the workload object, and there don't see, to be an easy way to avoid races.
+- Making `pod.Spec.PriorityClassName` and `pod.Spec.Priority` mutable fields and having a new
+  workload controller responsible for reconciling these. However, that could introduce another
+  divergence between the priority of pods and the priority defined in the PodTemplate in true
+  workload objects which would introduce a similar level of confusion to users.
+
+If we could address the race in validations, that seems like a desired option. However,
+I don't see an easy option for it.
+Given that, we suggest to proceed with just exposing the information about divergence in the
+Workload status (second option) and potentially improving it later.
 <<[/UNRESOLVED]>>
 ```
 
-The similar argument holds for preemption priority, but we argue that its mutable nature
-makes it infeasible for reconciling this information back to pod for scalability reasons
-(we can absolutely handle frequent updates to `Workload.Spec.PreemptionPriorityClassName`
-but we can't handle updating potentially hundreds of thousands of pods within that workload
-that frequently). In this case, we limit ourselves to documentation.
+It's worth mentioning here, that we want to introduce the same defaulting rules for
+`workload.Spec.PriorityClassName` that we have for pods. Namely, if `PriorityClassName` is unset
+and there exists PriorityClass marked as `globalDefault`, we default it to that value.
+This consistency will allow us to properly handle when users are not setting neither pods
+nor workload priorities.
+Similarly, we will ensure that `PriorityClass.preemptionPolicy` works exactly the same way for
+workloads as for pods. Such level of consistency would make adoption of Workload API much easier.
+
+Moving to `PreemptionPriorityClassName`, the same issue of confusion holds (the actual priority
+set at the pod level may not reflect priority used for preemption). We argue that its mutable
+nature makes it infeasible for reconsiling this information back to pods for scalability reasons
+(we can absolutely handle frequent updates to `Workload.Spec.PreemptionPriorityClassName`,
+but we can't handle updating potentially hundreds or thousands of pods within that workload
+that frequently). So in this case, we limit ourselves to documentation.
+
+```
+<<[UNRESOLVED preemption cycles]>>
+If we would allow for arbitrary relation between scheduling priority and preemption policy,
+we could hit an infinite cycle of preemption. Consider an example when:
+- workload A has scheduling priority `high` and preemption policy `low`
+- workload B has scheduling priority `high` and preemption policy `low`
+In such case, workload A can preempt workload B (`high` > `low`), but then workload B can
+also preempt workload A. This is definitely not desired.
+We can avoid the infinite cycle by ensuring that `scheduling priority <= preemption priority`.
+
+However, this also opens a question if we should allow for setting arbitrary high preemption
+priority for low scheduling priority workloads. Arguably we can claim that scheduling priority
+should be the ultimate truth and if there is a workload with higher priority it should be
+able to preempt it.
+So the alternative model that we can consider is to instead adding the concept of preemption
+priority, introduce a concept of "preemption cost". In such a model, the workload with
+higher priority can always preempt lower priority ones, but if we need to choose between
+two workloads to preempt, such preemption cost may result in choosing the one with higher
+priority amongst these two. Consider the following example:
+- we want to schedule workload A with scheduling priority `high`
+- it needs to preempt one of the already running workloads
+- workload B has scheduling priority `med` but preemption cost `low`
+- workload C has scheduling priority `low` but preemption cost `high`
+In such case, the preemption cost would result in choosing workload B for preemption. But
+if it gets recreated, it will preempt workload C causing unnecessary cascading preemption.
+This is the reason why a cost-based model was discarded.
+
+So for now, we suggest introducing only additional validation of scheduling priority to be
+not higher then preemption policy.
+<<[/UNRESOLVED]>>
+```
 
 ```
 <<[UNRESOLVED priority status]>>
@@ -356,6 +408,7 @@ We should introduce/describe `workload.status` to reflect:
 We start with describing at the high-level how existing pod-level preemption algorithm works.
 Below, we will show how to generalize it to workloads.
 
+If a pod P can be scheduled without triggering preemption, we don't consider preemption at all.
 To check if a pod P can be scheduled on a given node with preemption we:
 
 1. Identify the list of potential victims - all running pods with priority lower than the new pod P.
@@ -368,8 +421,8 @@ To check if a pod P can be scheduled on a given node with preemption we:
 1. From remaining potential victims, we start to reprieve pods starting from the highest priority
    and working down until the set of remaining victims still keeps the node feasible.
 
-Once we compute the feasibility and list of victims for all nodes, we score that and choose the
-best options.
+Once we find enough nodes feasible for preemption and list of victims for them, we score that and
+choose the best options.
 
 The above algorithm achieves our principles, as by eliminating highest priority pods first, it
 effectively tries to minimize the cascading preemptions later.
@@ -380,13 +433,13 @@ moving to the level of `Workload`, but also no longer operating at the level of
 We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
 becomes a challenge, thus we modify to the approach below.
 
-To check if a workload W can be scheduled on a given cluster with preemption we:
+To check if a (gang) PodGroup G can be scheduled on a given cluster with preemption we:
 
 1. Identify the list of potential victims:
-   - all running workloads with (preemption) priority lower than the new workload W
-   - all individual pods (not being part of workloads) with priority lower than the new workload W
+   - all running workloads with (preemption) priority lower than the new pod group G
+   - all individual pods (not being part of workloads) with priority lower than the new pod group G
 
-1. If removing all the potential victims would not make the new workload W schedulable,
+1. If removing all the potential victims would not make the new pod group G schedulable,
    the workload is unschedulable even with preemption.
 
 ```
@@ -402,15 +455,22 @@ with N being number of workload/pods violating PDB.
 
 1. For remaining potential victims, using binary search across priorities find the minimal priority P
    for which scheduling the new workload W doesn't require preempting any workloads and/or pods with
-   priority higher than P. This allows to reduce the potential cascading preemptions later.
+   priority higher than P. This allows to reduce the potential cascading prieemptions later.
+
+1. After eliminating all workloads and pod with priority higher than P (computed above) from the
+   potential victims list:
+
+   1. assume that all those potential victims are removed from the cluster and schedule new pod group
+      G with that assumption
+   1. sort the potential victims to reflect their "importance" (tentative proposal - sort first by
+      their priority, within a single priority prefer workloads)
+   1. go over the list of potential victims in the above order checking if they can be placed
+      where they are currently running. If so assume it back and remove from potential victims list.
 
 ```
 <<[UNRESOLVED minimizing preemptions]>>
-The following algorithm is by far no optimal, but is simple to reason about and I would suggest it as
-a starting point:
-- assume that all potential victims on the list are removed and schedule the new workload W
-- go over the remaining potential victims starting from the highest priority and check if these can
-  be placed in the place they are currently running; if so remove from the potential victims
+The above algorithm is by far no optimal, but is simple to reason about and I would suggest it as
+a starting point.
 
 As a bonus we may consider few potential placements of the new workload W here and choose the one that
 somehow optimizes the number of victims. But that will become more critical once we get to