Addressed review feedback

kubernetes · Feb 5, 2023 · 36b2919 · 36b2919
1 parent 3e476d0
commit 36b2919
Show file tree

Hide file tree

Showing 2 changed files with 53 additions and 39 deletions.
diff --git a/keps/sig-scheduling/3838-pod-mutable-scheduling-directives/README.md b/keps/sig-scheduling/3838-pod-mutable-scheduling-directives/README.md
@@ -156,16 +156,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 ## Summary
 
 In [#3521](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3521-pod-scheduling-readiness) 
-we introduced PodSchedulingReadiness. 
+we introduced PodSchedulingReadiness. The main use case for PodSchedulingReadiness was to 
+enable building external resource controllers (like extended schedulers, dynamic quota managers) 
+to make a decision on when the pod should be eligible for scheduling by kube-scheduler.
 
-The main use case for this feature was to enable building external resource controllers (like extended schedulers, 
-dynamic quota managers) to make a decision on when the pod should be eligible for scheduling by kube-scheduler.
+This enhancement proposes to make the pod scheduling directives node selector and node 
+affinity mutable as long as the pod is gated and the update to node selector/affinity 
+further limits the existing constrains. 
 
-This enhancement proposes to make pod scheduling directives (nodeSelector, affinity) mutable as long as the pod
-is gated and the update only further constrains the scheduling of the pod, and doesn't expand it.
-
-This allows external resource controllers to not only decide when the pod should be eligible for scheduler, but 
-also where it should land. This is similar to what we did for suspended Job in [#2926](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/2926-job-mutable-scheduling-directives); 
+This allows external resource controllers to not only decide when the pod should be eligible
+for scheduling, but also where it should land. This is similar to what we did for suspended 
+Job in [#2926](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/2926-job-mutable-scheduling-directives).
 
 <!--
 This section is incredibly important for producing high-quality, user-focused
@@ -190,8 +191,14 @@ updates.
 
 Adding the ability to mutate a Pod's scheduling directives before it is allowed to schedule 
 gives an external resource controller (like a network toplogy-aware scheduler or a dynamic 
-quota managers) the ability to influence placement while at the same time offloading actual 
-pod-to-node assignment to the existing kube-scheduler.
+quota manager) the ability to influence pod placement while at the same time offload actual 
+pod-to-node assignment to kube-scheduler. 
+
+In general, this opens the door for a new pattern of adding scheduling features to Kubernetes.
+Specifically, building lightweight schedulers that implement features not supported by kube-scheduler,
+while relying on the existing kube-scheduler to support all upstream featuers and handle the pod-to-node
+binding. This pattern should be the preferred one if the custom feature doesn't require implementing a 
+scheduler plugin, which entails re-building and maintaining a custom kube-scheduler binary.
 
 <!--
 This section is for explicitly listing the motivation, goals, and non-goals of
@@ -204,8 +211,8 @@ demonstrate the interest in a KEP within the wider Kubernetes community.
 
 ### Goals
 
-- Allow adding node affinity or node selector of a pod that is blocked on a
-  scheduling readiness gate.
+- Allow mutating a pod that is blocked on a scheduling readiness gate with 
+  a more constrained node affinity/selector.
 
 <!--
 List the specific goals of the KEP. What is it trying to achieve? How will we
@@ -214,11 +221,11 @@ know that this has succeeded?
 
 ### Non-Goals
 
-- Allow mutating scheduling directives of pods unconditionally.
+- Allow mutating node affinity/selector of pods unconditionally.
 - Restrict updating tolerations on scheduling gates. We already allow adding tolerations 
-  unconditionally, and so restricting that to gated pods is a backward incompatible change.
-- Allow mutating pod affinity constraints. It is not clear how this can be done without
-  potentially violating a policy check that we previously done at pod CREATE.
+  unconditionally, and so restricting that to gated pods is not a backward compatible change.
+- Allow mutating pod (anti)affinity constraints. It is not clear how this can be done without
+  potentially violating a policy check that was previously done at pod CREATE.
 
 <!--
 What is out of scope for this KEP? Listing non-goals helps to focus discussion
@@ -227,13 +234,14 @@ and make progress.
 
 ## Proposal
 
-The proposal is to relax update validation of scheduling directives, specifically node affinity and selector, of Pods that
-are blocked on a scheduling readiness gate (i.e., not yet considered for scheduling by kube-scheduler), specifically. Note
-that adding tolerations and mutating labels and annotations are already allowed.
-
-This has no impact on the applications controllers. The applications controller generally has no dependency
-on those scheduling directives expressed in the application's pod template.
+Node affinity and selector are currently immutable for Pods. The proposal is to relax this validation
+for Pods that are blocked on a scheduling readiness gate (i.e., not yet considered for scheduling by 
+kube-scheduler). Note that adding tolerations and mutating labels and annotations are
+already allowed.
 
+This has no impact on applications controllers (like the ReplicaSet controller). Application controllers
+generally has no dependency on the scheduling directives expressed in the application's pod template, and 
+they don't reconcile existing pods if they changed.
 
 <!--
 This is where we get down to the specifics of what the proposal actually is.
@@ -256,13 +264,13 @@ bogged down.
 #### Story 1
 
 I want to build a controller that implements workload queueing and influences when and where a 
-workload should run. The workload controller creates the pods a workload instance, and to control 
-when and where the workload can run, I have a webhook that injects a scheduling readiness gate, 
+workload should run. The workload controller creates the pods of a workload instance. To control 
+when and where the workload should run, I have a webhook that injects a scheduling readiness gate, 
 and a controller that removes the gate when capacity becomes available.
 
-At the time of creating the workload, we may not know on which part of the cluster (.e.g., a zone or 
-a VM type) a workload should run. To control where a workload should run, the controller can update 
-node affinity of the workload's pods; by doing so, the queue controller is able to influence the 
+At the time of creating the workload, it may not be known on which subset of the nodes (e.g., a zone or 
+a VM type) a workload should run. To control where a workload should run, the controller updates 
+the node selector/affinity of the workload's pods; by doing so, the queue controller is able to influence the 
 scheduler and autoscaler decisions.
 
 ### Notes/Constraints/Caveats (Optional)
@@ -276,12 +284,12 @@ This might be a good place to talk about core concepts and how they relate.
 
 ### Risks and Mitigations
 
-- New calls from a queue controller to the API server to update affinity or tolerations. 
+- New calls from a queue controller to the API server to update node affinity/selector. 
   The mitigation is for such controllers to make a single API call for both updating
-  affinity/tolerations and removing the scheduling gate.
+  affinity/selector and removing the scheduling gate.
 
 - App controllers may find it surprising that the pod they created is different from their
-  template. Typically apps, like ReplicaSets, don't continusly reconile existing pods
+  template. Typically apps don't continusly reconile existing pods
   to match the template, they react only to change to the pod template in the app's
   spec. Moreover, the side affects of this proposal is no different from a mutating 
   webhook that injects affinity onto the pod.
@@ -301,13 +309,14 @@ Consider including folks who also work outside the SIG or subproject.
 ## Design Details
 
 The pod update validation logic in the API server already allows updating labels and 
-annotations, it also allows adding tolerations. What we need to do is relax the validation
-on adding node affinity and node selector.
+annotations, it also allows adding new tolerations. What we need to do is to relax the update validation
+on node affinity/selector.
 
 There are two main problems we need to address:
 
-1) Race conditions between adding an affinity and kube-scheduler scheduling the pod. As mentioned before,
-the solution is to restrict the updates to pods with scheduling gates. This prevents the race condition.
+1) Race conditions between updating node affinity/selector constraints and the kube-scheduler when scheduling
+the pod. As mentioned before, the solution is to restrict the updates to pods with scheduling gates. This
+prevents the race condition.
 
 2) Invalidating a decision made by a policy admission webhook. The concern here is that some of those policy
 webhooks validate node affinity/selector on CREATE only and don't trigger that validation on UPDATE because 
@@ -320,7 +329,10 @@ are ORed, while `nodeSelectorTerms[].matchExpressions` list and
 `nodeSelectorTerms[].fieldExpressions` are ANDed, and so for pods with existing 
 `nodeAffinity` terms, only additions of `NodeSelectorRequirements` to `matchExpressions` 
 or `fieldExpressions` are allowed; for pods with no `nodeAffinity` terms, then 
-adding anything is allowed.
+adding anything is allowed. 
+
+No changes to existing terms will be allowed since that makes it harder to reason about the change and it is not necessary
+to allow the semantics of "further constraining the node selection".
 
 <!--
 This section should contain enough information that the specifics of your
@@ -381,8 +393,8 @@ extending the production code to implement this enhancement.
 ##### Integration tests
 
 - Test that adding node affinity to a pod with no scheduling gates should be rejected.
-- Test that adding affinity that expands the existing constraints should always be rejected, even for pods with scheduling gates.
-- Test that adding affinity that constraints the node selection to a pod with scheduling gates should be allowed. Also, test that the 
+- Test that adding node affinity/selector that relaxes the existing constraints should always be rejected, even for pods with scheduling gates.
+- Test that adding node affinity/selector that constraints the node selection to a pod with scheduling gates should be allowed. Also, test that the 
   scheduler observes the update and acts on it when the scheduling gate is lifted.
 - Test feature enablement/disablement.
 
@@ -415,7 +427,7 @@ We will release the feature directly in Beta state. Because the feature is opt-i
 a new field, there is no benefit in having an alpha release.
 
 #### Beta
-- Feature implemented behind a feature flag
+- Feature implemented behind the same feature flag we used for Pod Scheduling Readiness feature.
 - Unit and integration tests passing 
 
 #### GA

diff --git a/keps/sig-scheduling/3838-pod-mutable-scheduling-directives/kep.yaml b/keps/sig-scheduling/3838-pod-mutable-scheduling-directives/kep.yaml
@@ -13,6 +13,7 @@ approvers:
 
 see-also:
   - "/keps/sig-scheduling/2926-job-mutable-scheduling-directives"
+  - "/keps/sig-scheduling/3521-pod-scheduling-readiness"
 
 # The target maturity stage in the current dev cycle for this KEP.
 stage: beta
@@ -28,9 +29,10 @@ milestone:
   stable: "v1.28"
 
 feature-gates:
-  - name: PodMutableNodeSchedulingDirectives
+  - name: PodSchedulingReadiness
     components:
       - kube-apiserver
+      - kube-scheduler
 disable-supported: true
 
 # The following PRR answers are required at beta release