-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KEP for volume scheduling limits #942
Conversation
|
||
##### Alpha -> Beta Graduation | ||
|
||
N/A (`AttachVolumeLimit` feature is already beta). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since CSI migration is targeting Beta in next quarter, are you saying attach limit migration will be Beta by next quarter too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VolumeAttachLimit
feature is already beta in 1.14. We're changing implementation underneath, hoping that we can keep it still beta.
c8eb762
to
eeeb7b1
Compare
Passed initial review round, now the KEP is complete. |
|
||
- `ResourceName` is limited to 63 characters. We must prefix `ResourceName` with unique string (such as `attachable-volumes-csi-<driver name>`) so it cannot collide with existing resources like `cpu` or `memory`. But `<driver name>` itself is up to 63 character long, so we ended up with using SHA-sums of driver name to keep the `ResourceName` unique, which is not user readable. | ||
- CSI driver cannot share its limits with in-tree volume plugin e.g. when running pods with AWS EBS in-tree volumes and `ebs.csi.aws.com` CSI driver on the same node. | ||
- `node.status` size increases with each installed CSI driver. Node objects are big enough already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if this is an issue anymore once the Node heartbeat has been moved to its own object.
Space-wise, it's a similar amount of space since we have to store an equivalent amount in the CSINode object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the bullet.
|
||
### Non-Goals | ||
|
||
- Heterogenous clusters, i.e. clusters where access to storage is limited only to some nodes. Existing `PV.spec.nodeAffinity` handling, not modified by this KEP, will filter out nodes that don't have access to the storage, so predicates changed in this KEP don't need to worry about storage topology and can be simpler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea was to schedule pods to nodes based on CSINode that indicates what drivers are installed on the node, and maybe even health status in the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added as non-goal.
|
||
* Kubelet will create `CSINode` instance during initial node registration together with `Node` object. | ||
* Limits of each in-tree volume plugin will be added to `CSINode.status.allocatable`. | ||
* Limit for in-tree volumes will be added by kubelet during CSINode creation. Name of corresponding CSI driver will be used as key in `CSINode.status.allocatable` and it will be discovered using [CSI translation library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib). If the library does not support migration of an in-tree volume plugin, the volume plugin has no limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all in-tree plugins today that report limits also have a translation layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Azure is not listed as "migratable" in staging/src/k8s.io/csi-translation-lib/translate.go.
@andyzhangx, do you plan to add Azure migration support to 1.15 as alpha?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsafrane thanks for letting me know, I have filed below two issues to add support in 1.15:
kubernetes/kubernetes#76684
kubernetes/kubernetes#76685
|
||
##### Removing a deprecated flag | ||
|
||
- Announce deprecation and support policy of the existing flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be specific on what flag is being deprecated here? And when are we going to deprecate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a typo? There are no flags exposed by volume limit feature IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whole section 'Removing a deprecated flag' is taken from KEP template. We don't add any flags, removed.
|
||
##### Beta -> GA Graduation | ||
|
||
It must graduate together with CSI migration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to consider adding a cache before GA to improve predicate performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we need to have the implementation fast enough even in beta. It may turn out to be impossible to speed up the code enough after beta to reach GA requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may have missed it but what do we forsee being the slow part? The scheduler accessing CSINode API object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not wrong- I think @msau42 meant caching of volumes in-use on a node. Currently the number of volumes in-use on a node is computed by iterating through running/in-flight pods and then adding number of volumes they are using.
This can be cached and cache can be kept in-sync with decisions scheduler makes. Caching will potentially save some CPU cycles but IMO should not be targeted for this release. It can get fairly tricky pretty quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for beta, our goal is same performance as the current implementation. Before we go to GA, do we want to have a goal of improving performance by caching these in-use counts per node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a line.
* CSI driver is used both for in-tree and CSI PVs (if the driver is installed). | ||
* Kubelet creates `CSINode` during node registration with no limit for the volume plugin / CSI driver. | ||
* When CSI driver is registered, kubelet gets its volume limit through CSI and updates `CSINode` with the new limit. | ||
* During the period when no CSI driver is registered and CSINode exists, there is "no limit" for the CSI driver and scheduler can put pods there! See non-goals! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or can we do something like set limit to 0 so that no pods can be scheduled there and wait for the CSI driver to be installed?
If we have no limit then more pods could get scheduled but will get stuck in attach and require user intervention to retry
Regardless when migration is on and driver is not installed yet, pods won't be able to come up, so maybe it's better to not schedule the pod to that node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was fixed. After deprecation period, we will stop scheduling pods to a node that does not have CSINode.Spec.Driver
or if CSINode
object itself is missing.
| key is missing in `CSINode.status.allocatable` | - | there is no limit of volumes on the node* | | ||
| `Volumes` | Description | | ||
| --------- | ------------ | | ||
| 0 | plugin / CSI driver exists and has zero limit, i.e. can attach no volumes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to set 0? CSI spec says that a 0 value is undefined and CO can interpret it however it wants. For backwards compatibility I would expect 0 value from CSI to mean unlimited
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question comes up if in the future we add capacity limits. How to tell the difference between 0 and undefined? Do we need the fields to be pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While discussing the issue of stopping pods from getting scheduled on a node if driver is not installed on it - I thought it might be best to handle this as a separate issue. We can indeed set value to 0
when driver is not installed but this applies only to drivers that support migration.
Since we won't be creating CSINode
object for in-tree drivers which don't support migration the solution of setting the volume limit to 0
might not be good enough. We may have to go one step further and say prevent scheduling of volumes if CSINode
object for a driver does not exist on a node? If yes then - that sounds like something that is kinda out of scope for volume limit work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only concerned about drivers that are being migrated. If we set the limit to "unlimited", then during this (hopefully) brief moment, Pods that are using in-tree drivers can fail attaching, and require manual intervention to fix. One of the goals of migration is that users shouldn't notice that we switched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question comes up if in the future we add capacity limits. How to tell the difference between 0 and undefined?
If a driver is installed, it is in CSINode.Spec.Drivers
. Then the driver either has its limit in CSINode.Status
or it's missing there, and that means there is no limit.
IMO, scheduling based on on availability of a driver on a node is a separate issue and separate predicate. IMO it should be fairly trivial to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be a separate issue but we also need to decide if that should block migration ga or not. @davidz627
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduling based on availability of a driver on a node is a blocker for migration GA since not having that would be a regression in volume behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was addressed by @jsafrane in recent commit. Scheduler will not schedule pods to a node if it does not have driver entry in CSINode
object.
// allocatable is a list of volume limits for each volume plugin and CSI driver on the node. | ||
// +patchMergeKey=name | ||
// +patchStrategy=merge | ||
Allocatable []VolumeLimits `json:"allocatable" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,1,rep,name=allocatable"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a CSINodeDriverStatus to mirror spec more closely?
Is it odd that a status for a driver may exist before the spec? Although in the past designs iterations for migration we had confusing behavior where driver spec would be partially filled in before driver was installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to avoid having a partially filled spec. That leaves the style in this design or having a status without a spec as the only options.
For the current design that means if we ever add a "real status field" we have to be ok with the pretty weird schema of:
apiVersion: storage.k8s.io/v1beta1
kind: CSINode
metadata:
name: ip-172-18-4-112.ec2.internal
spec:
status:
drivers:
- name: ebs.csi.aws.com
fieldA: valueA
allocatable:
# AWS node can attach max. 40 volumes, 1 is reserved for the system
- name: ebs.csi.aws.com
volumes: 39
where drivers are duplicated between the allocatable and drivers fields.
What is the API guidance on having a Status
without a corresponding Spec
. Is that even weirder? @liggitt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why is having a Status
field without having a corresponding Spec
field weird. Node
Object is a good example and CSINode
object in-fact mirrors properties of a node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may end up with two arrays with the same keys and that would be indeed weird. We could do:
status:
drivers:
- name: ebs.csi.aws.com
fieldXYZ: x
allocatable:
volumes: 39
size: "64 TB"
But that would complicate sharing of limits between drivers, if we ever implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a meeting we decided to move allocatable
into CSINode.spec.drivers[xyz]
, because it does not change in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly looks good to me! One important comment on the CSINode Status design though
// allocatable is a list of volume limits for each volume plugin and CSI driver on the node. | ||
// +patchMergeKey=name | ||
// +patchStrategy=merge | ||
Allocatable []VolumeLimits `json:"allocatable" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,1,rep,name=allocatable"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to avoid having a partially filled spec. That leaves the style in this design or having a status without a spec as the only options.
For the current design that means if we ever add a "real status field" we have to be ok with the pretty weird schema of:
apiVersion: storage.k8s.io/v1beta1
kind: CSINode
metadata:
name: ip-172-18-4-112.ec2.internal
spec:
status:
drivers:
- name: ebs.csi.aws.com
fieldA: valueA
allocatable:
# AWS node can attach max. 40 volumes, 1 is reserved for the system
- name: ebs.csi.aws.com
volumes: 39
where drivers are duplicated between the allocatable and drivers fields.
What is the API guidance on having a Status
without a corresponding Spec
. Is that even weirder? @liggitt
|
||
##### Beta -> GA Graduation | ||
|
||
It must graduate together with CSI migration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may have missed it but what do we forsee being the slow part? The scheduler accessing CSINode API object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jsafrane!
- User can run use PVs both with in-tree volume plugins and CSI and they will share their limits. There is only one scheduler predicate that handles both kind of volumes. | ||
|
||
- Existing predicates for in-tree volumes `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` are removed (with deprecation period). | ||
- When both deprecated in-tree predicate and CSI predicate are enabled, only one of them does useful work and the other is NOOP to save CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be considered an invalid config? Do we need to support it in the transition phase for backward compatibility? Also, later in the doc it is mentioned that CSI predicate will be used in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be considered an invalid config?
Ideally yes, but MaxEBSVolumeCount
, MaxGCEPDVolumeCount
, MaxAzureDiskVolumeCount
MaxCinderVolumeCount
and MaxCSIVolumeCountPred
are all existing predicates and right now they can be all enabled together. I can make Max<cloud>VolumeCount
and MaxCSIVolumeCountPred
mutually exclusive and it would simplify implementation a lot. Is it allowed? IMO it could break clusters that have all the predicates enabled.
Also, later in the doc it is mentioned that CSI predicate will be used in this case.
Clarified a bit ("MaxCSIVolumeCountPred
does useful work...")
| 0 | plugin / CSI driver exists and has zero limit, i.e. can attach no volumes | | ||
| X>0 | plugin / CSI driver exists and can attach X volumes (where X > 0) | | ||
| X<0 | negative values are blocked by validation | ||
| Driver is missing in `CSINode.status.allocatable` | there is no limit of volumes on the node* | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does not happen and what does it mean that "driver" does not exist in CSINode.status.allocatable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean that "driver" does not exist in CSINode.status.allocatable
CSINode.status.allocatable
is a map[DriverName]VolumeLimits
. If a driver is missing there, it can mean two things:
- Driver is not installed on the node (or it is installed and informers are slow to propagate it to all components)
- Driver is installed on the node and has no limits.
In both cases, scheduler expects that there is no limit.
Differentiation between these two cases is discussed here: #942 (comment)
// allocatable is a list of volume limits for each volume plugin and CSI driver on the node. | ||
// +patchMergeKey=name | ||
// +patchStrategy=merge | ||
Allocatable []VolumeLimits `json:"allocatable" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,1,rep,name=allocatable"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't this be a map[string]int32
that maps a CSI driver name to its volume limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to add total volume size later. We can add either two maps, map[string]int32
for count of volumes and map[string]Quantity
for total size. Or we can have maps of structs as it is now. Both will work, not sure what's the preference here. Also see discussion at #942 (comment)
/label api-review |
* Especially, `kubelet --kube-reserved` or `--system-reserved` cannot be used to "reserve" volumes for kubelet or the OS. It is not possible with existing kubelet and this KEP does not change it. We expect that CSI drivers will have configuration options / cmdline arguments to reserve some volumes and they will report their limit already reduced by that reserved amount. | ||
* Scheduler will respect `Node.status.allocatable` and `Node.status.capacity` for CSI volumes if `CSINode` object is not available or has missing entry in `CSINode.spec.drivers[xyz].allocatable` during a deprecation period but kubelet will stop populating `Node.status.allocatable` and `Node.status.capacity` for CSI volumes. | ||
* After deprecation period for CSI volumes, limits coming from `Node.status.allocatable` and `Node.status.capacity` will be completely ignored by the scheduler. | ||
* After deprecation period, scheduler won't schedule any pods that use CSI volumes to a node with missing `CSINode` instance or missing driver in `CSINode.Spec.Drivers`. It is expected that it happens only during node registration when `Node` exists and `CSINode` doesn't and it self-heals quickly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, whole sentence "It is expected that it happens only during node registration when Node
exists and CSINode
doesn't and it self-heals quickly" is obsolete. There is no self-heal because there is no limits in CSINode created at kubelet startup. Scheduler needs to wait for a driver to get installed there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this seems OK. Is it "better" in CSINode or just simpler? E.g. if we added a volume-specific status block with volume-specific allocatable to node, would we prefer that for any reason?
|
||
- Maximum capacity per node. Some cloud environments limit both number of attached volumes (covered in this KEP) and total capacity of attached volumes (not covered in this KEP). For example, this KEP will ensure that scheduler puts max. 128 GCE PD volumes to a [typical GCE node](https://cloud.google.com/compute/docs/machine-types#predefined_machine_types), but it won't ensure that the total capacity of the volumes is less than 64 TB. | ||
|
||
- Volume limits does not yet integrate with cluster autoscaler if all nodes in the cluster are running at maximum volume limits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwielgus In case you have not seen this one
|
||
## Proposal | ||
|
||
* Track volume limits for CSI driver in `CSINode` objects instead of `Node`. Limit in `CSINode` is used instead of limit coming from `Node` object whenever available for same in-tree volume type. This mean scheduler will always try to translate in-tree driver name to CSI driver name whenever `CSINode` object has same in-tree volume type (even if migration is off). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does scheduler already consider CSINode anywhere? Would be worth saying whether this exists or is net-new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduler does not consider CSINode
object anywhere currently. I can perhaps clarify it some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@thockin it is bit simpler to use |
@thockin okay addressed open comments. PTAL. :-) |
editor: TBD | ||
creation-date: 2019-04-08 | ||
last-updated: 2019-04-08 | ||
status: provisional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be changed to implementable
/hold for changing the status bit |
update the spec to not create CSINode object by default for in-tree volume types
@msau42 updated as implementable. Also updated reviewers and approvers to be more accurate. Added myself as one of the authors. Can you please re-lgtm? |
/lgtm |
/hold cancel |
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: jsafrane, msau42, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This KEP changes existing scheduler predicates for volume limits from using
Node.status.allocatable
toCSINode.spec.drivers["xyz"]
for CSI drivers and in-tree volumes (in some cases).Feature: #554