-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5381: Mutable PersistentVolume Node Affinity #5382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
huww98
commented
Jun 6, 2025
- One-line PR description: adding new KEP
- Issue link: Mutable PersistentVolume Node Affinity #5381
- Other comments:
|
|
||
| 1. Change APIServer validation to allow `PersistentVolume.spec.nodeAffinity` to be mutable. | ||
|
|
||
| 2. Change CSI Specification to allow `ControllerModifyVolume` to return a new accessibility requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point 2 and 3, though not directly compatible with the title, are connected as upstream and downstream changes. They're currently combined, but if needed, we'll create a new KEP for separation later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new field or it can fit into the mutable parameters map? Actually nvm there is the proto change below.
| These modification can be expressed by `VolumeAttributeClass` in Kubernetes. | ||
| But sometimes, A modification to volume comes with change to its accessibility, such as: | ||
| 1. migration of data from one zone to regional storage; | ||
| 2. enabling features that is not supported by all the client nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it deviates from nodeAffinity's original design(geography related), this scenario is currently useful for users, hence we include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we prevent the new update to the nodeAffinity is not compatible with the current node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we have some rules around you can only add wider nodeAffinity support but not shrink it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just don't enforce this. We ask SPs to not disrupt the running workload. With this promise, we allow SP to return any new nodeAffinity, even if it is not compatible with currently published node. The affinity will be ignored for running pod, just like Pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.
| * or being rejected from nodes that actually can access the volume, getting stuck in a `Pending` state. | ||
|
|
||
| By making `PersistentVolume.spec.nodeAffinity` field mutable, | ||
| we give storage providers a chance to propagate latest accessibility requirement to the scheduler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we measure the consistency guarantee of the "latest" requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question, but after some discussion we don't seem to have a suitable solution, especially since there is also an informer cache, and we noticed that the #4876 doesn't mention this problem either, so can we just ignore it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we can't just ignore it. If I understood Sunny's question correctly, its trying to poke at potential race conditions or issues which could lead to unexpected results. For example, (in the current proposal) if two ModifyVolume calls are made in parallel with different accessibility requirements, how do we know what "latest" is? In other words, what are the mechanisms that ensure the actual state of the world reflects the desired state.
These sorts of questions were addressed in KEP 4876 -#4875 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. The latest requirement is the one corresponding to the mutable_parameters desired by CO passed in ControllerModifyVolumeRequest. In Kubernetes, external-resizer will save the VAC name to PV status after ModifyVolume finishes. It can save the returned requirement to PV just before that.
If anything wrong happens (race, crash, etc.), CO can always issue another ControllerModifyVolume request to fetch the latest topology. This means we should require SP to return accessible_topology if supported. I will update the KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we still planning to do ControllerModifyVolume with nodeAffinity changes? I was thinking in CSI meetings there was a discussion about using a new CSI API specific for nodeAffinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but maybe delayed. We want to land this KEP as soon as possible. This should extend the ControllerModifyVolume and not blocking it from GA.
|
|
||
| | Condition | gRPC Code | Description | Recovery Behavior | | ||
| |-----------|-----------|-------------|-------------------| | ||
| | Topology conflict | 9 FAILED_PRECONDITION | Indicates that the CO has requested a modification that would make the volume inaccessible to some already attached nodes. | Caller MAY detach the volume from the nodes that are in conflict and retry. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a infeasible error? What is the retry/rollback story for the users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes does not perform any automatic correction for this, it is exposed to Event as other errors.
Currently, external-resizer should just retry with exponential backoff. In the future, maybe we can retry after any ControllerUnpublishVolume succeeded, maybe after external-resizer and external-attacher combined into the same process.
User should just rollback the VAC to the previous value to cancel the request.
| // from a given node when scheduling workloads. | ||
| // This field is OPTIONAL. If it is not specified, the CO SHOULD assume | ||
| // the topology is not changed by this modify volume request. | ||
| repeated Topology accessible_topology = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an alpha_field.
| We will extend CSI specification to add: | ||
| ```protobuf | ||
| message ControllerModifyVolumeResponse { | ||
| option (alpha_message) = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving VolumeAttributesClass to GA in 1.34. That means we need to move ControllerModifyVolume in CSI spec from alpha to GA and cut a CSI spec release before 1.34 release.
I'm wondering if we can move this message to GA while adding a new alpha_field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have two options to address this issue:
- Target the entire KEP in v1.35
- Isolate the VAC part, limiting mutable PV node affinity to v1.34, we still got one weeks before enhancement freeze
Maybe we can talk about this on the sig-storage meeting
| If this happens, the pod will be stuck in a `ContainerCreating` state. | ||
| User will have to manually delete the pod, or using Kubernetes [descheduler](https://github.com/kubernetes-sigs/descheduler) or similar. | ||
|
|
||
| We propose to added an `in_progress` field in `ControllerModifyVolumeResponse` to allow more timely affinity update, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before altering the topology to an unfeasible state, consider adding clearer guidance on handling scheduled pods.
|
|
||
| // Indicates whether the CO allows the SP to update the topology | ||
| // as a part of the modification. | ||
| bool allow_topology_updates = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a new parameters in mutable_parameters right? Let's include an example demonstrating the usage of ControllerModifyVolumeRequest and ControllerModifyVolumeResponse with allow_topology_updates.
e096f8b to
11e029a
Compare
|
According to the last zoom meeting, I've made the following changes:
Please see the KEP for the reasons and details. |
11e029a to
51ac059
Compare
7c76e1a to
95584e0
Compare
|
Consider a scenario where the scheduler selects "node-A" for a pod and annotates the corresponding PVC with volume.kubernetes.io/selected-node: node-A. If an administrator concurrently modifies the nodeAffinity of the underlying PV to only allow scheduling on "node-B," a conflict will arise. The pod, when it attempts to start on "node-A," may fail if the volume cannot be mounted there due to the updated affinity rules. How do we resolve this? Or do we have a list of scenarios that is considered against the guideline so the admin or user should not be doing that? |
|
@sunnylovestiramisu I think this is discussed in the Caveats section as "Possible race condition". I've just learnt we reject pod when attachment limit is exceeded. I think we can use similar method to deal with the mis-scheduled pods, i.e., when we detect PV node affinity violation and a specific error code at ControllerPublishVolume, we mark the Pod as Failed and hope some controller to re-create it. |
29c805f to
19d29ec
Compare
| `PersistentVolume.spec.nodeAffinity.required` should works like | ||
| `Pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution`. | ||
| It is never re-evaluated when the pod is already running. | ||
| It is storage provider's responsibility to ensure that the running workload is not interrupted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the PVC is used by a Pod and user changes the PersistentVolume.spec.nodeAffinity, the nodeAffinity rule may be out of sync with the node where the pod is running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we think the modified PersistentVolume.spec.nodeAffinity shouldn't impact running pods; its will effects during the next scheduling or when a new pod claims it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoever modifies nodeAffinity should ensure this will not affect the running pods.
| ### Integrate with the CSI spec and VolumeAttributeClass | ||
|
|
||
| We have proposed the plan to integrate in the previous version of this KEP. | ||
| But we did not reach consensus due to lack of SP want to implement this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you summarize what the solution is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we propose a separate KEP to discuss this to avoid confusion in this one? (btw we already write about this part in previous version of this KEP)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that previous version, please refer to the previous commit in this PR
|
|
||
| Type A node only supports cloud_ssd, while Type B node supports both cloud_ssd and cloud_essd. | ||
| We will only allow the modification if the volume is attached to type B nodes. | ||
| And I want to make sure the Pods using new cloud_essd volume not to be scheduled to type A nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who will be updating PersistentVolume.spec.nodeAffinity? Some downstream controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, We initially planned to implement this in VAC, but due to unresolved questions from last few month's meetings, we've decided to focus on making pv nodeaffinity mutable for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we want external-resizer to update it. Currently, we are intended to do this in our fork of external-resizer, or with a dedicated controller.
|
I've done a prototype implementation at kubernetes/kubernetes#134339 Also added kubelet behavior when pod with incompatible PV nodeAffinity running on the node, tested with the implementation mentioned above. |
filled Production Readiness Review Questionnaire
8984c84 to
d5181dc
Compare
| we fail the Pods that use them, since we are sure the Pods have never been running. | ||
| (see [Handling race condition](#handling-race-condition) below) | ||
| - For CSI drivers with `requiresRepublish` set to true, we will stop calling NodePublishVolume periodically. and an event is emitted. | ||
| - For CSI drivers with `requiresRepublish` set to false, an event is emitted on kubelet restart. Otherwise the pod should continue to run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should periodically emit the error event since it doesn't cause pod failures, but it may block volume-related node operations like filesystem expansion. Alerting the user and guiding them to fix PV affinity is necessary.
|
|
||
| Whoever modifies the `PersistentVolume.spec.nodeAffinity` field should ensure that | ||
| no running Pods on nodes with incompatible labels are using the PV. | ||
| Kubernetes will not verify this. It is expensive and racy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something, so let me try to understand the proposal better.
Imagine that my PV is attached to node:
- node N1 where pod P1 using it is running
- node N2 where pod P2 using it is running
I want to update nodeAffinity in a way that N2 no longer satisfies the new affinity.
-
What happens to pod P2 if I don't delete it manually?
-
What happens to the PV itself after its
nodeAffinityis updated. Will it be automatically detached from N2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to pod P2 if I don't delete it manually?
It will continue to run if the volume is actually accessible, with some known limitations listed in the next paragraph.
What happens to the PV itself after its nodeAffinity is updated. Will it be automatically detached from N2?
No. We will still detach it from N2 after P2 deleted. Basically the new affinity only affects new pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add that explicitly to the doc to make it clear?
|
|
||
| This feature involves changes to the kubelet, and APIServer. But they are not strongly coupled. | ||
|
|
||
| an n-3 kubelet will not able to fail the mis-scheduled pods. User can still manually delete the pods. Otherwise it should be fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what happens to those pods? In what state do they stuck? Is it ContainerCreating?
Ensuring that kubelet is able to handle that correctly (i.e. all supported kubelets handle that) seems like at least a GA blocker to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what happens to those pods? In what state do they stuck? Is it ContainerCreating?
Yes, ContainerCreating.
seems like at least a GA blocker to me.
If user does not actually modify PV node affinity, there will be no such mis-scheduled pods. User should upgrade and enable feature gate for all components before using this new feature, right?
| If a pod being scheduled to a node that is incompatible with the PV's nodeAffinity, the pod will fail. | ||
| Previously, it will be stuck at `ContainerCreating` status. | ||
|
|
||
| This should be rare, since we don't allow PV nodeAffinity to be updated, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this - the whole goal of this KEP is to allow to update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean this should be rare before enabling this feature. So that when the feature is enabled, it is rare that some pod existing in the cluster will be affected by the behavior change.
| PV `spec.nodeAffinity` becomes mutable. | ||
|
|
||
| If a pod being scheduled to a node that is incompatible with the PV's nodeAffinity, the pod will fail. | ||
| Previously, it will be stuck at `ContainerCreating` status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we shouldn't actually roll this out with two separate feature-gates:
- the kubelet one that handles the mis-scheduled pods
Ensure that this graduated to GA on all supported versions, before graduating the apiserver FG. - the FG that allows for mutating nodeAffinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That may be an option. But we can instruct user not to use this feature before kubelet rollout finished to avoid the race condition.
Another KEP also embrace similar mechanism to handle race condition when node volume count exceeded. It also use one feature gate.
If we do use two feature-gate, should we track them in two KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The different that I see between these KEPs is that:
-
in the mutable allocatable, the whole workflow is generally through system components, there is no user in the picture (the update is triggered by the CSI driver); so as an administrator I have full control over it
-
in this particular KEP:
(a) we additionally have a user in the picture [a user is the one that can trigger nodeAffinity change]
(b) they can do that in a setup where kubelet will keep the pod stuck in ContainerCreating
(b) is my primary concern here.
No - definitely not two keps. There are different keps with multiple FGs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that KEP, there is likely a human operation that triggers the allocatable count change. Someone may re-configured the node, attached another device to the node, etc.
And note this should be rare, it is a race condition that only happens when user update PV and create Pod at the same time. And is easily resolvable by upgrade kubelet afterwards or manually delete Pod.
If I understand this correctly, we need to wait for at least 3 release between enable these two feature gate. That is not a short time. Would it be too careful? Since the enhancement freeze is near. Can we merge this PR and keep the discussion in its dedicated PR? I think this is not blocking for alpha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, changing PV object is an operation that is highly privileged, very similar to Node change. I think we could assume that users who have such permissions know enough about what they can cause by such a change. They may choose to ensure that no Pod users the PV, manually resolve the race, or postpone their PV modification until all kubelets have the feature gate enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK fair. I would like these thoughts to be better reflected in the KEP.
But let's work on updating the KEP to better reflect all those tradeoffs before Beta - I will let it go for Alpha to not block you for the next 4 months.
|
|
||
| Whoever modifies the `PersistentVolume.spec.nodeAffinity` field should ensure that | ||
| no running Pods on nodes with incompatible labels are using the PV. | ||
| Kubernetes will not verify this. It is expensive and racy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add that explicitly to the doc to make it clear?
| 5. KCM/external-attacher attaches the volume to the node, and find the affinity mismatch. | ||
|
|
||
| If this happens, the pod will be stuck in a `ContainerCreating` state. | ||
| Kubelet should detect this contidion and reject the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see feedback from someone from the SIG node about that.
It sounds fine to me, but I'm not an expert here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We (sig-storage) are marking pods as Failed when kubelet detects the CSI driver can't attach more volumes to it here
Similarly to this KEP, it solves a race condition and we expect the owner (Deployment/StatefulSet) re-creates the Pod.
| PV `spec.nodeAffinity` becomes mutable. | ||
|
|
||
| If a pod being scheduled to a node that is incompatible with the PV's nodeAffinity, the pod will fail. | ||
| Previously, it will be stuck at `ContainerCreating` status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The different that I see between these KEPs is that:
-
in the mutable allocatable, the whole workflow is generally through system components, there is no user in the picture (the update is triggered by the CSI driver); so as an administrator I have full control over it
-
in this particular KEP:
(a) we additionally have a user in the picture [a user is the one that can trigger nodeAffinity change]
(b) they can do that in a setup where kubelet will keep the pod stuck in ContainerCreating
(b) is my primary concern here.
No - definitely not two keps. There are different keps with multiple FGs.
| For Pods stuck in ContainerCreating due to storage provider unable to attach the volume to the scheduled node, | ||
| The Pod will be rejected by kubelet and re-created at the correct node. | ||
| For Pods stuck in Pending due to no suitable node available, | ||
| scheduler will retry to schedule the Pod according to the updated nodeAffinity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is answer is misleading, as many of those are indirect consequences of this feature.
I think the answer really is:
- nodeAffinity can now be updated for existing volumes
- pods that cannot be run due volume that can't be attached are now being failed by kubelet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right that many of those are indirect. I will also add the direct part as you describe.
But those are written to be corresponding to the "Motivation" part of this doc. I'm interpreting the subtitle "working for their instance" as meeting the initial motivation when proposing this change.
e315d6c to
5c0d167
Compare
|
I think it's good enough for alpha. I actually like the extra thoughts about mis-scheduled pods and failing them in kiubelet. |
|
/lgtm |
|
I think it's acceptable for Alpha from PRR POV. That said, we should much better reflect all the tradeoffs we're making (e.g. PV update being previliged operation etc.) in the KEP before Beta - the way it is currently described doesn't meet the Beta bar, so I will be more conservative the next release. /approve PRR |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: huww98, jsafrane, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |