Skip to content

Commit

Permalink
Addressed review feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
ahg-g committed Feb 5, 2023
1 parent 3e476d0 commit 36b2919
Show file tree
Hide file tree
Showing 2 changed files with 53 additions and 39 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -156,16 +156,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
## Summary

In [#3521](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3521-pod-scheduling-readiness)
we introduced PodSchedulingReadiness.
we introduced PodSchedulingReadiness. The main use case for PodSchedulingReadiness was to
enable building external resource controllers (like extended schedulers, dynamic quota managers)
to make a decision on when the pod should be eligible for scheduling by kube-scheduler.

The main use case for this feature was to enable building external resource controllers (like extended schedulers,
dynamic quota managers) to make a decision on when the pod should be eligible for scheduling by kube-scheduler.
This enhancement proposes to make the pod scheduling directives node selector and node
affinity mutable as long as the pod is gated and the update to node selector/affinity
further limits the existing constrains.

This enhancement proposes to make pod scheduling directives (nodeSelector, affinity) mutable as long as the pod
is gated and the update only further constrains the scheduling of the pod, and doesn't expand it.

This allows external resource controllers to not only decide when the pod should be eligible for scheduler, but
also where it should land. This is similar to what we did for suspended Job in [#2926](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/2926-job-mutable-scheduling-directives);
This allows external resource controllers to not only decide when the pod should be eligible
for scheduling, but also where it should land. This is similar to what we did for suspended
Job in [#2926](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/2926-job-mutable-scheduling-directives).

<!--
This section is incredibly important for producing high-quality, user-focused
Expand All @@ -190,8 +191,14 @@ updates.

Adding the ability to mutate a Pod's scheduling directives before it is allowed to schedule
gives an external resource controller (like a network toplogy-aware scheduler or a dynamic
quota managers) the ability to influence placement while at the same time offloading actual
pod-to-node assignment to the existing kube-scheduler.
quota manager) the ability to influence pod placement while at the same time offload actual
pod-to-node assignment to kube-scheduler.

In general, this opens the door for a new pattern of adding scheduling features to Kubernetes.
Specifically, building lightweight schedulers that implement features not supported by kube-scheduler,
while relying on the existing kube-scheduler to support all upstream featuers and handle the pod-to-node
binding. This pattern should be the preferred one if the custom feature doesn't require implementing a
scheduler plugin, which entails re-building and maintaining a custom kube-scheduler binary.

<!--
This section is for explicitly listing the motivation, goals, and non-goals of
Expand All @@ -204,8 +211,8 @@ demonstrate the interest in a KEP within the wider Kubernetes community.

### Goals

- Allow adding node affinity or node selector of a pod that is blocked on a
scheduling readiness gate.
- Allow mutating a pod that is blocked on a scheduling readiness gate with
a more constrained node affinity/selector.

<!--
List the specific goals of the KEP. What is it trying to achieve? How will we
Expand All @@ -214,11 +221,11 @@ know that this has succeeded?

### Non-Goals

- Allow mutating scheduling directives of pods unconditionally.
- Allow mutating node affinity/selector of pods unconditionally.
- Restrict updating tolerations on scheduling gates. We already allow adding tolerations
unconditionally, and so restricting that to gated pods is a backward incompatible change.
- Allow mutating pod affinity constraints. It is not clear how this can be done without
potentially violating a policy check that we previously done at pod CREATE.
unconditionally, and so restricting that to gated pods is not a backward compatible change.
- Allow mutating pod (anti)affinity constraints. It is not clear how this can be done without
potentially violating a policy check that was previously done at pod CREATE.

<!--
What is out of scope for this KEP? Listing non-goals helps to focus discussion
Expand All @@ -227,13 +234,14 @@ and make progress.

## Proposal

The proposal is to relax update validation of scheduling directives, specifically node affinity and selector, of Pods that
are blocked on a scheduling readiness gate (i.e., not yet considered for scheduling by kube-scheduler), specifically. Note
that adding tolerations and mutating labels and annotations are already allowed.

This has no impact on the applications controllers. The applications controller generally has no dependency
on those scheduling directives expressed in the application's pod template.
Node affinity and selector are currently immutable for Pods. The proposal is to relax this validation
for Pods that are blocked on a scheduling readiness gate (i.e., not yet considered for scheduling by
kube-scheduler). Note that adding tolerations and mutating labels and annotations are
already allowed.

This has no impact on applications controllers (like the ReplicaSet controller). Application controllers
generally has no dependency on the scheduling directives expressed in the application's pod template, and
they don't reconcile existing pods if they changed.

<!--
This is where we get down to the specifics of what the proposal actually is.
Expand All @@ -256,13 +264,13 @@ bogged down.
#### Story 1

I want to build a controller that implements workload queueing and influences when and where a
workload should run. The workload controller creates the pods a workload instance, and to control
when and where the workload can run, I have a webhook that injects a scheduling readiness gate,
workload should run. The workload controller creates the pods of a workload instance. To control
when and where the workload should run, I have a webhook that injects a scheduling readiness gate,
and a controller that removes the gate when capacity becomes available.

At the time of creating the workload, we may not know on which part of the cluster (.e.g., a zone or
a VM type) a workload should run. To control where a workload should run, the controller can update
node affinity of the workload's pods; by doing so, the queue controller is able to influence the
At the time of creating the workload, it may not be known on which subset of the nodes (e.g., a zone or
a VM type) a workload should run. To control where a workload should run, the controller updates
the node selector/affinity of the workload's pods; by doing so, the queue controller is able to influence the
scheduler and autoscaler decisions.

### Notes/Constraints/Caveats (Optional)
Expand All @@ -276,12 +284,12 @@ This might be a good place to talk about core concepts and how they relate.

### Risks and Mitigations

- New calls from a queue controller to the API server to update affinity or tolerations.
- New calls from a queue controller to the API server to update node affinity/selector.
The mitigation is for such controllers to make a single API call for both updating
affinity/tolerations and removing the scheduling gate.
affinity/selector and removing the scheduling gate.

- App controllers may find it surprising that the pod they created is different from their
template. Typically apps, like ReplicaSets, don't continusly reconile existing pods
template. Typically apps don't continusly reconile existing pods
to match the template, they react only to change to the pod template in the app's
spec. Moreover, the side affects of this proposal is no different from a mutating
webhook that injects affinity onto the pod.
Expand All @@ -301,13 +309,14 @@ Consider including folks who also work outside the SIG or subproject.
## Design Details

The pod update validation logic in the API server already allows updating labels and
annotations, it also allows adding tolerations. What we need to do is relax the validation
on adding node affinity and node selector.
annotations, it also allows adding new tolerations. What we need to do is to relax the update validation
on node affinity/selector.

There are two main problems we need to address:

1) Race conditions between adding an affinity and kube-scheduler scheduling the pod. As mentioned before,
the solution is to restrict the updates to pods with scheduling gates. This prevents the race condition.
1) Race conditions between updating node affinity/selector constraints and the kube-scheduler when scheduling
the pod. As mentioned before, the solution is to restrict the updates to pods with scheduling gates. This
prevents the race condition.

2) Invalidating a decision made by a policy admission webhook. The concern here is that some of those policy
webhooks validate node affinity/selector on CREATE only and don't trigger that validation on UPDATE because
Expand All @@ -320,7 +329,10 @@ are ORed, while `nodeSelectorTerms[].matchExpressions` list and
`nodeSelectorTerms[].fieldExpressions` are ANDed, and so for pods with existing
`nodeAffinity` terms, only additions of `NodeSelectorRequirements` to `matchExpressions`
or `fieldExpressions` are allowed; for pods with no `nodeAffinity` terms, then
adding anything is allowed.
adding anything is allowed.

No changes to existing terms will be allowed since that makes it harder to reason about the change and it is not necessary
to allow the semantics of "further constraining the node selection".

<!--
This section should contain enough information that the specifics of your
Expand Down Expand Up @@ -381,8 +393,8 @@ extending the production code to implement this enhancement.
##### Integration tests

- Test that adding node affinity to a pod with no scheduling gates should be rejected.
- Test that adding affinity that expands the existing constraints should always be rejected, even for pods with scheduling gates.
- Test that adding affinity that constraints the node selection to a pod with scheduling gates should be allowed. Also, test that the
- Test that adding node affinity/selector that relaxes the existing constraints should always be rejected, even for pods with scheduling gates.
- Test that adding node affinity/selector that constraints the node selection to a pod with scheduling gates should be allowed. Also, test that the
scheduler observes the update and acts on it when the scheduling gate is lifted.
- Test feature enablement/disablement.

Expand Down Expand Up @@ -415,7 +427,7 @@ We will release the feature directly in Beta state. Because the feature is opt-i
a new field, there is no benefit in having an alpha release.

#### Beta
- Feature implemented behind a feature flag
- Feature implemented behind the same feature flag we used for Pod Scheduling Readiness feature.
- Unit and integration tests passing

#### GA
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ approvers:

see-also:
- "/keps/sig-scheduling/2926-job-mutable-scheduling-directives"
- "/keps/sig-scheduling/3521-pod-scheduling-readiness"

# The target maturity stage in the current dev cycle for this KEP.
stage: beta
Expand All @@ -28,9 +29,10 @@ milestone:
stable: "v1.28"

feature-gates:
- name: PodMutableNodeSchedulingDirectives
- name: PodSchedulingReadiness
components:
- kube-apiserver
- kube-scheduler
disable-supported: true

# The following PRR answers are required at beta release
Expand Down

0 comments on commit 36b2919

Please sign in to comment.