Skip to content

Commit

Permalink
Add KEP for volume scheduling limits
Browse files Browse the repository at this point in the history
  • Loading branch information
jsafrane committed Apr 8, 2019
1 parent dee70c1 commit eeeb7b1
Showing 1 changed file with 242 additions and 0 deletions.
242 changes: 242 additions & 0 deletions keps/sig-storage/20190408-volume-scheduling-limits.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
---
title: KEP Template
authors:
- "@jsafrane"
owning-sig: sig-storage
participating-sigs:
- sig-scheduling
reviewers:
- "@bsalamat"
- "@gnufied"
- "@davidz627"
approvers:
- "@bsalamat"
- "@davidz627"
editor: TBD
creation-date: 2019-04-08
last-updated: 2019-04-08
status: provisional
see-also:
- https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190129-csi-migration.md
replaces: https://github.com/kubernetes/enhancements/pull/730
superseded-by:
---

# Volume Scheduling Limits

## Table of Contents

* [Volume Scheduling Limits](#volume-scheduling-limits)
* [Table of Contents](#table-of-contents)
* [Release Signoff Checklist](#release-signoff-checklist)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [New API](#new-api)
* [User Stories](#user-stories)
* [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
* [Risks and Mitigations](#risks-and-mitigations)
* [Design Details](#design-details)
* [Test Plan](#test-plan)
* [Graduation Criteria](#graduation-criteria)
* [Alpha -> Beta Graduation](#alpha---beta-graduation)
* [Beta -> GA Graduation](#beta---ga-graduation)
* [Removing a deprecated flag](#removing-a-deprecated-flag)
* [Upgrade / Downgrade / Version Skew Strategy](#upgrade--downgrade--version-skew-strategy)
* [Implementation History](#implementation-history)
* [Alternatives](#alternatives)

## Release Signoff Checklist

- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
- [ ] KEP approvers have set the KEP status to `implementable`
- [ ] Design details are appropriately documented
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [ ] Graduation criteria is in place
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).

## Summary

Number of volumes of certain type that can be attached to a node should be configurable easily and should be based on node type. This proposal implements dynamic attachable volume limits on a per-node basis rather than cluster global defaults that exist today. This proposal also implements a way of configuring volume limits for CSI volumes.

This proposal replaces [#730](https://github.com/kubernetes/enhancements/pull/730) and integrates volume limits for in-tree volumes (AWS EBS, GCE PD, AZURE DD, OpenStack Cinder) and CSI into one predicate. As result, in-tree volumes and corresponding CSI driver can share the same volume limit.

## Motivation

Current scheduler predicates for scheduling of pods with volumes is based on `node.status.capacity` and `node.status.allocatable`. It works well for hardcoded predicates for volume limits on AWS (`MaxEBSVolumeCount`), GCE(`MaxGCEPDVolumeCount`), Azure (`MaxAzureDiskVolumeCount`) and OpenStack (`MaxCinderVolumeCount`).

It is problematic for CSI (`MaxCSIVolumeCountPred`) outlined in [#730](https://github.com/kubernetes/enhancements/pull/730)

- `ResourceName` is limited to 63 characters. We must prefix `ResourceName` with unique string (such as `attachable-volumes-csi-<driver name>`) so it cannot collide with existing resources like `cpu` or `memory`. But `<driver name>` itself is up to 63 character long, so we ended up with using SHA-sums of driver name to keep the `ResourceName` unique, which is not user readable.
- CSI driver cannot share its limits with in-tree volume plugin e.g. when running pods with AWS EBS in-tree volumes and `ebs.csi.aws.com` CSI driver on the same node.
- `node.status` size increases with each installed CSI driver. Node objects are big enough already.

### Goals

- User can run use PVs both with in-tree volume plugins and CSI and they will share their limits. There is only one scheduler predicate that handles both kind of volumes.

- Existing predicates for in-tree volumes `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` are removed (with deprecation period).
- When both deprecated in-tree predicate and CSI predicate are enabled, only one of them does useful work and the other is NOOP to save CPU.

- Scheduler does not increase its CPU consumption. Any regression must be approved by sig-scheduling.

### Non-Goals

- Heterogenous clusters, i.e. clusters where access to storage is limited only to some nodes. Existing `PV.spec.nodeAffinity` handling, not modified by this KEP, will filter out nodes that don't have access to the storage, so predicates changed in this KEP don't need to worry about storage topology and can be simpler.

## Proposal

* Track volume limits for both in-tree volume plugins and CSI drivers in `CSINode` objects instead of `Node`.
* To get rid of prefix + SHA for `ResourceName` of CSI volumes.
* So in-tree volume plugin can share limits with CSI driver that uses the same storage backend.

* Kubelet will create `CSINode` instance during initial node registration together with `Node` object.
* Limits of each in-tree volume plugin will be added to `CSINode.status.allocatable`.
* Limit for in-tree volumes will be added by kubelet during CSINode creation. Name of corresponding CSI driver will be used as key in `CSINode.status.allocatable` and it will be discovered using [CSI translation library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib). If the library does not support migration of an in-tree volume plugin, the volume plugin has no limit.
* If a CSI driver is registered for an in-tree volume plugin and it reports a different volume limit than in-tree volume plugin, the limit reported by CSI driver is used and kubelet logs a warning.
* User may NOT change `CSINode.status.allocatable` to override volume plugin / CSI driver values, e.g. to "reserve" some attachment to the operating system. Kubelet will periodically reconcile `CSINode` and overwrite the value.
* Especially, `kubelet --kube-reserved` or `--system-reserved` cannot be used to "reserve" volumes for kubelet or the OS. It is not possible with current kubelet and this KEP does not change it.
* We expect that CSI drivers will have configuration options / cmdline arguments to reserve some volumes and they will report their limit already reduced by that reserved amount.

* Kubelet will continue filling `Node.status.allocatable` and `Node.status.capacity` for both in-tree and CSI volumes during deprecation period. After the deprecation period, it will stop using them completely.
* Scheduler (all its storage predicates) will ignore `Node.status.allocatable` and `Node.status.capacity` if `CSINode.status.allocatable` is present.
* If `CSINode.status.allocatable` (or whole `CSINode`) is missing, scheduler falls back to `Node.status.allocatable`. This solves version skew between old kubelet (using `Node.status`) and new scheduler.
* After deprecation period, scheduler won't schedule any pods that use volumes to a node with missing `CSINode` instance. It is expected that it happens only during node registration when `Node` exists and `CSINode` doesn't and it self-heals quickly.

* `CSINode.status.allocatable` is a map CSI Driver name -> int64. Following combinations are possible:

| Driver name | value | Description |
| ----------------- | ----- | ------------ |
| `ebs.csi.aws.com` | 0 | plugin / CSI driver exists and has zero limit, i.e. can attach no volumes |
| `ebs.csi.aws.com` | X>0 | plugin / CSI driver exists and can attach X volumes (where X > 0) |
| `ebs.csi.aws.com` | X<0 | negative values are blocked by validation
| key is missing in `CSINode.status.allocatable` | - | there is no limit of volumes on the node* |

*) This way we are not able to distinguish between a volume plugin / CSI driver not installed on a node or it has been installed and it has no limits.
* Predicates modified in this KEP assume that storage provided by **in-tree** volume plugin is available all nodes in the cluster. Other predicate(s) will evaluate `PV.spec.nodeAffinity` and filter out nodes that don't have access to the storage.
* For CSI drivers, availability of a CSI driver on a node can be checked in `CSINode.spec`.

CSINode example:

```yaml
apiVersion: storage.k8s.io/v1beta1
kind: CSINode
metadata:
name: ip-172-18-4-112.ec2.internal
spec:
status:
allocatable:
# AWS node can attach max. 40 volumes, 1 is reserved for the system
ebs.csi.aws.com: 39
```
* Existing scheduler predicates `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` are already deprecated.
* If any of them is enabled together with `MaxCSIVolumeCountPred`, the deprecated predicate will do nothing (`MaxCSIVolumeCountPred` does the job of counting both in-tree and CSI volumes).
* The deprecated predicates will do any useful work only when `MaxCSIVolumeCountPred` predicate is disabled.
* This way, we save CPU by running only one volume limit predicate during deprecation period.


### New API

CSINode gets `Status` struct with `Allocatable`, holding limit of volumes for each volume plugin and CSI driver that can be scheduled to the node.

```go
type CSINode struct {
...
// spec is the status of CSINode
Status CSINodeStatus `json:"status" protobuf:"bytes,3,opt,name=status"`
}

// VolumeLimits is map CSI driver name -> maximum count of volumes for the driver on the node.
// For in-tree volume plugins, name of corresponding CSI driver is used.
// Value can be either:
// - Positive integer: that's the volume limit.
// - Zero: such volume cannot be used on the node.
// - key is missing in VolumeLimits: there is no volume limit, i.e. any number of volumes can be used on the node.
type VolumeLimits map[string]int64

// CSINodeStatus holds information about the status of all CSI drivers installed on a node
type CSINodeStatus struct {
// allocatable is a list of volume limits for each volume plugin and CSI driver on the node.
// +patchMergeKey=name
// +patchStrategy=merge
Allocatable []VolumeLimits `json:"allocatable" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,1,rep,name=allocatable"`
}
```

### User Stories


### Implementation Details/Notes/Constraints

[CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib) is used to find CSI driver name for in-tree volume plugins + its `VolumeHandle`. This CSI driver name is used as key in `CSINode.status.allocatable` list. The `VolumeHandle` is unique for each volume and will be used to de-duplicate volumes used by multiple pods on the same node.

### Risks and Mitigations

* This KEP depends on [CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib). It can happen that CSI migration is redesigned / cancelled.
* Countermeasure: [CSI migration](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190129-csi-migration.md) and this KEP should graduate together.

* This KEP depends on [CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib) ability to handle in-line in-tree volumes. Scheduler will need to get CSI driver name + `VolumeHandle` from them to count them towards the limit.

## Design Details

Existing feature gate `AttachVolumeLimit` will be re-used for implementation of this KEP. The feature is already beta and is enabled by default.

### Test Plan

* Run [scheduler benchmark](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scheduling/scheduler_benchmarking.md) with matrix composed of:
* Predicates:
* All volume predicates enabled.
* Only deprecated `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` predicates enabled.
* Only `MaxCSIVolumeCountPred` predicate enabled.
* API objects:
* Both CSINode and Node containing `status.allocatable` for a volume plugin (to simulate kubelet during deprecation period).
* Only CSINode containing `status.allocatable` for a volume plugin (to simulate kubelet after deprecation period).
* Only Node containing `status.allocatable` for a volume plugin (to simulate old kubelet).
* Test results should be ideally the same as before the KEP.
* Any deviation needs to be approved by sig-scheduling.

* Run e2e tests and kubelet version skew tests to check that scheduler picks the right values from CSINode or Node.

* Add e2e test that runs pods with both in-tree volumes and CSI driver for the same storage backend and check that they share the same volume limits.

### Graduation Criteria

##### Alpha -> Beta Graduation

N/A (`AttachVolumeLimit` feature is already beta).

##### Beta -> GA Graduation

It must graduate together with CSI migration.

##### Removing a deprecated flag

- Announce deprecation and support policy of the existing flag
- Two versions passed since introducing the functionality which deprecates the flag (to address version skew)
- Address feedback on usage/changed behavior, provided on GitHub issues
- Deprecate the flag

### Upgrade / Downgrade / Version Skew Strategy


During upgrade, downgrade or version skew, kubelet may be older that scheduler. Kubelet will not fill `CSINode.status` with volume limits and it will fill volume limits into `Node.status`. Scheduler must fall back to `Node.status` when `CSINode` is not available or its `status` does not contain a volume plugin / CSI driver.

## Implementation History


# Alternatives

In https://github.com/kubernetes/enhancements/pull/730 we tried to merge volume limits in `Node.status.capacity` and `Node.status.attachable`. We discovered these issues:

* We cannot use plain CSI driver name as resource name `Node.status.attachable`, as it could collide with other resources (e.g. "memory"), so we added volume specific prefix.
* Since CSI driver name can be [up to 63 character long](https://github.com/container-storage-interface/spec/blob/master/spec.md#getplugininfo), the prefix + driver name it cannot fit 64 character resource name limit. We ended up hashing the driver name to save space.

By moving volume limit to CSINode we fix both issues.

0 comments on commit eeeb7b1

Please sign in to comment.