Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for per pod cgroups #2841

Merged
merged 1 commit into from
Mar 21, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 148 additions & 68 deletions docs/admin/node-allocatable.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,115 +9,182 @@ title: Reserving Compute Resources for System Daemons
* TOC
{:toc}

Kubernetes nodes can be scheduled to `Capacity`. Pods can consume all the available capacity on a node by default. This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself. Unless resources are set aside for these system daemons, pods and system daemons compete for resources and lead to resource starvation issues on the node.

The `kubelet` exposes a feature named `Node Allocatable` that helps to reserve compute resources for system daemons. Kubernetes recommends cluster administrators to configure `Node Allocatable` based on their workload density on each node.
Kubernetes nodes can be scheduled to `Capacity`. Pods can consume all the
available capacity on a node by default. This is an issue because nodes
typically run quite a few system daemons that power the OS and Kubernetes
itself. Unless resources are set aside for these system daemons, pods and system
daemons compete for resources and lead to resource starvation issues on the
node.

The `kubelet` exposes a feature named `Node Allocatable` that helps to reserve
compute resources for system daemons. Kubernetes recommends cluster
administrators to configure `Node Allocatable` based on their workload density
on each node.

## Node Allocatable

```text
Node Capacity
---------------------------
| kube-reserved |
| kube-reserved |
|-------------------------|
| system-reserved |
| system-reserved |
|-------------------------|
| eviction-threshold |
| eviction-threshold |
|-------------------------|
| |
| allocatable |
| (available |
| for pods) |
| |
| |
| |
| allocatable |
| (available for pods) |
| |
| |
---------------------------
```

`Allocatable` on a Kubernetes node is defined as the amount of compute resources that are available for pods.
The scheduler does not over subscribe `Allocatable`.
`CPU` and `memory` are supported as of now.
Support for `storage` is expected to be added in the future.
`Allocatable` on a Kubernetes node is defined as the amount of compute resources
that are available for pods. The scheduler does not over-subscribe
`Allocatable`. `CPU` and `memory` are supported as of now. Support for `storage`
is expected to be added in the future.

Node Allocatable is exposed as part of `v1.Node` object in the API and as part of `kubectl describe node` in the CLI.
Node Allocatable is exposed as part of `v1.Node` object in the API and as part
of `kubectl describe node` in the CLI.

Resources can be reserved for two categories of system daemons in the `kubelet`.

### Enabling QoS and Pod level cgroups

To properly enforce node allocatable constraints on the node, you must
enable the new cgroup hierarchy via the `--cgroups-per-qos` flag. This flag is
enabled by default. When enabled, the `kubelet` will parent all end-user pods
under a cgroup hierarchy managed by the `kubelet`.

### Configuring a cgroup driver

The `kubelet` supports manipulation of the cgroup hierarchy on
the host using a cgroup driver. The driver is configured via the
`--cgroup-driver` flag.

The supported values are the following:

* `cgroupfs` is the default driver that performs direct manipulation of the
cgroup filesystem on the host in order to manage cgroup sandboxes.
* `systemd` is an alternative driver that manages cgroup sandboxes using
transient slices for resources that are supported by that init system.

Depending on the configuration of the associated container runtime,
operators may have to choose a particular cgroup driver to ensure
proper system behavior. For example, if operators use the `systemd`
cgroup driver provided by the `docker` runtime, the `kubelet` must
be configured to use the `systemd` cgroup driver.

### Kube Reserved

- **Kubelet Flag**: `--kube-reserved=[cpu=100m][,][memory=100Mi]`
- **Kubelet Flag**: `--kube-reserved-cgroup=`

`kube-reserved` is meant to capture resource reservation for kubernetes system daemons like the `kubelet`, `container runtime`, `node problem detector`, etc.
`kube-reserved` is meant to capture resource reservation for kubernetes system
daemons like the `kubelet`, `container runtime`, `node problem detector`, etc.
It is not meant to reserve resources for system daemons that are run as pods.
`kube-reserved` is typically a function of `pod density` on the nodes.
[This performance dashboard](http://node-perf-dash.k8s.io/#/builds) exposes `cpu` and `memory` usage profiles of `kubelet` and `docker engine` at multiple levels of pod density.
[This blog post](http://blog.kubernetes.io/2016/11/visualize-kubelet-performance-with-node-dashboard.html) explains how the dashboard can be interpreted to come up with a suitable `kube-reserved` reservation.

To optionally enforce `kube-reserved` on system daemons, specify the parent control group for kube daemons as the value for `--kube-reserved-cgroup` kubelet flag.

It is recommended that the kubernetes system daemons are placed under a top level control group (`runtime.slice` on systemd machines for example).
Each system daemon should ideally run within its own child control group.
Refer to [this doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md#recommended-cgroups-setup) for more details on recommended control group hierarchy.

Note that Kubelet **does not** create `--kube-reserved-cgroup` if it doesn't exist. Kubelet will fail if an invalid cgroup is specified.
`kube-reserved` is typically a function of `pod density` on the nodes. [This
performance dashboard](http://node-perf-dash.k8s.io/#/builds) exposes `cpu` and
`memory` usage profiles of `kubelet` and `docker engine` at multiple levels of
pod density. [This blog
post](http://blog.kubernetes.io/2016/11/visualize-kubelet-performance-with-node-dashboard.html)
explains how the dashboard can be interpreted to come up with a suitable
`kube-reserved` reservation.

To optionally enforce `kube-reserved` on system daemons, specify the parent
control group for kube daemons as the value for `--kube-reserved-cgroup` kubelet
flag.

It is recommended that the kubernetes system daemons are placed under a top
level control group (`runtime.slice` on systemd machines for example). Each
system daemon should ideally run within its own child control group. Refer to
[this
doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md#recommended-cgroups-setup)
for more details on recommended control group hierarchy.

Note that Kubelet **does not** create `--kube-reserved-cgroup` if it doesn't
exist. Kubelet will fail if an invalid cgroup is specified.

### System Reserved

- **Kubelet Flag**: `--system-reserved=[cpu=100mi][,][memory=100Mi]`
- **Kubelet Flag**: `--system-reserved-cgroup=`


`system-reserved` is meant to capture resource reservation for OS system daemons like `sshd`, `udev`, etc.
`system-reserved` should reserve `memory` for the `kernel` too since `kernel` memory is not accounted to pods (yet) in Kubernetes.
Reserving resources for user login sessions is also recommended (`user.slice` in systemd world).
`system-reserved` is meant to capture resource reservation for OS system daemons
like `sshd`, `udev`, etc. `system-reserved` should reserve `memory` for the
`kernel` too since `kernel` memory is not accounted to pods in Kubernetes at this time.
Reserving resources for user login sessions is also recommended (`user.slice` in
systemd world).

To optionally enforce `system-reserved` on system daemons, specify the parent control group for OS system daemons as the value for `--system-reserved-cgroup` kubelet flag.
To optionally enforce `system-reserved` on system daemons, specify the parent
control group for OS system daemons as the value for `--system-reserved-cgroup`
kubelet flag.

It is recommended that the OS system daemons are placed under a top level control group (`system.slice` on systemd machines for example).
It is recommended that the OS system daemons are placed under a top level
control group (`system.slice` on systemd machines for example).

Note that Kubelet **does not** create `--system-reserved-cgroup` if it doesn't exist. Kubelet will fail if an invalid cgroup is specified.
Note that Kubelet **does not** create `--system-reserved-cgroup` if it doesn't
exist. Kubelet will fail if an invalid cgroup is specified.

### Eviction Thresholds

- **Kubelet Flag**: `--eviction-hard=[memory.available<500Mi]`

Memory pressure at the node level leads to System OOMs which affects the entire node and all pods running on it.
Nodes can go offline temporarily until memory has been reclaimed.
To avoid (or reduce the probability) system OOMs kubelet provides [`Out of Resource`](./out-of-resource.md) management.
Evictions are supported for `memory` and `storage` only.
By reserving some memory via `--eviction-hard` flag, the `kubelet` attempts to `evict` pods whenever memory availability on the node drops below the reserved value.
Hypothetically, if system daemons did not exist on a node, pods cannot use more than `capacity - eviction-hard`.
For this reason, resources reserved for evictions are not available for pods.
Memory pressure at the node level leads to System OOMs which affects the entire
node and all pods running on it. Nodes can go offline temporarily until memory
has been reclaimed. To avoid (or reduce the probability of) system OOMs kubelet
provides [`Out of Resource`](./out-of-resource.md) management. Evictions are
supported for `memory` and `storage` only. By reserving some memory via
`--eviction-hard` flag, the `kubelet` attempts to `evict` pods whenever memory
availability on the node drops below the reserved value. Hypothetically, if
system daemons did not exist on a node, pods cannot use more than `capacity -
eviction-hard`. For this reason, resources reserved for evictions are not
available for pods.

### Enforcing Node Allocatable

- **Kubelet Flag**: `--enforce-node-allocatable=pods[,][system-reserved][,][kube-reserved]`

The scheduler treats `Allocatable` as the available `capacity` for pods.

`kubelet` enforce `Allocatable` across pods by default. Enforcement is performed by evicting pods whenever the overall usage across all pods exceeds `Allocatable`. More details on eviction policy can be found [here](./out-of-resource.md#eviction-policy)
This enforcement is controlled by specifying `pods` value to the kubelet flag `--enforce-node-allocatable`.
`kubelet` enforce `Allocatable` across pods by default. Enforcement is performed
by evicting pods whenever the overall usage across all pods exceeds
`Allocatable`. More details on eviction policy can be found
[here](./out-of-resource.md#eviction-policy) This enforcement is controlled by
specifying `pods` value to the kubelet flag `--enforce-node-allocatable`.


Optionally, `kubelet` can be made to enforce `kube-reserved` and `system-reserved` by specifying `kube-reserved` & `system-reserved` values in the same flag.
Note that to enforce `kube-reserved` or `system-reserved`, `--kube-reserved-cgroup` or `--system-reserved-cgroup` needs to be specified respectively.
Optionally, `kubelet` can be made to enforce `kube-reserved` and
`system-reserved` by specifying `kube-reserved` & `system-reserved` values in
the same flag. Note that to enforce `kube-reserved` or `system-reserved`,
`--kube-reserved-cgroup` or `--system-reserved-cgroup` needs to be specified
respectively.

## General Guidelines

System daemons are expected to be treated similar to `Guaranteed` pods.
System daemons can burst within their bounding control groups and this behavior needs to be managed as part of kubernetes deployments.
For example, `kubelet` should have its own control group and share `Kube-reserved` resources with the container runtime.
However, Kubelet cannot burst and use up all available Node resources if `kube-reserved` is enforced.
System daemons are expected to be treated similar to `Guaranteed` pods. System
daemons can burst within their bounding control groups and this behavior needs
to be managed as part of kubernetes deployments. For example, `kubelet` should
have its own control group and share `Kube-reserved` resources with the
container runtime. However, Kubelet cannot burst and use up all available Node
resources if `kube-reserved` is enforced.

Be extra careful while enforcing `system-reserved` reservation since it can lead to critical system services being CPU starved or OOM killed on the node.
The recommendation is to enforce `system-reserved` only if a user has profiled their nodes exhaustively to come up with precise estimates and is confident in their ability to recover if any process in that group is oom_killed.
Be extra careful while enforcing `system-reserved` reservation since it can lead
to critical system services being CPU starved or OOM killed on the node. The
recommendation is to enforce `system-reserved` only if a user has profiled their
nodes exhaustively to come up with precise estimates and is confident in their
ability to recover if any process in that group is oom_killed.

* To begin with enforce `Allocatable` on `pods`.
* Once adequate monitoring and alerting is in place to track kube system daemons, attempt to enforce `kube-reserved` based on usage heuristics.
* Once adequate monitoring and alerting is in place to track kube system
daemons, attempt to enforce `kube-reserved` based on usage heuristics.
* If absolutely necessary, enforce `system-reserved` over time.

The resource requirements of kube system daemons may grow over time as more and more features are added.
Over time, kubernetes project will attempt to bring down utilization of node system daemons, but that is not a priority as of now.
The resource requirements of kube system daemons may grow over time as more and
more features are added. Over time, kubernetes project will attempt to bring
down utilization of node system daemons, but that is not a priority as of now.
So expect a drop in `Allocatable` capacity in future releases.

## Example Scenario
Expand All @@ -130,21 +197,34 @@ Here is an example to illustrate Node Allocatable computation:
* `--eviction-hard` is set to `memory.available<500Mi`

Under this scenario, `Allocatable` will be `14.5 CPUs` & `28.5Gi` of memory.
Scheduler ensures that the total `requests` across all pods on this node does not exceed `28.5Gi`.
Kubelet evicts pods whenever the overall memory usage exceeds across pods exceed `28.5Gi`.
If all processes on the node consume as much CPU as they can, pods together cannot consume more than `14.5 CPUs`.
Scheduler ensures that the total `requests` across all pods on this node does
not exceed `28.5Gi`. Kubelet evicts pods whenever the overall memory usage
exceeds across pods exceed `28.5Gi`. If all processes on the node consume as
much CPU as they can, pods together cannot consume more than `14.5 CPUs`.

If `kube-reserved` and/or `system-reserved` is not enforced and system daemons exceed their reservation, `kubelet` evicts pods whenever the overall node memory usage is higher than `31.5Gi`.
If `kube-reserved` and/or `system-reserved` is not enforced and system daemons
exceed their reservation, `kubelet` evicts pods whenever the overall node memory
usage is higher than `31.5Gi`.

## Feature Availability

Since `v1.2`, it has been possible to **optionally** specify `kube-reserved` and `system-reserved` reservations.
The scheduler switched to using `Allocatable` instead of `Capacity` when available in the same release.

Since `v1.6`, `eviction-thresholds` are being considered by computing `Allocatable`.
To revert to the old behavior set `--experimental-allocatable-ignore-eviction` kubelet flag to `true`.

Since `v1.6`, `kubelet` enforces `Allocatable` on pods using control groups.
To revert to the old behavior unset `--enforce-node-allocatable` kubelet flag.
Note that unless `--kube-reserved`, or `--system-reserved` or `--eviction-hard` flags have non-default values, `Allocatable` enforcement does not affect existing deployments.
As of Kubernetes version 1.2, it has been possible to **optionally** specify
`kube-reserved` and `system-reserved` reservations. The scheduler switched to
using `Allocatable` instead of `Capacity` when available in the same release.

As of Kubernetes version 1.6, `eviction-thresholds` are being considered by
computing `Allocatable`. To revert to the old behavior set
`--experimental-allocatable-ignore-eviction` kubelet flag to `true`.

As of Kubernetes version 1.6, `kubelet` enforces `Allocatable` on pods using
control groups. To revert to the old behavior unset `--enforce-node-allocatable`
kubelet flag. Note that unless `--kube-reserved`, or `--system-reserved` or
`--eviction-hard` flags have non-default values, `Allocatable` enforcement does
not affect existing deployments.

As of Kubernetes version 1.6, `kubelet` launches pods in their own cgroup
sandbox in a dedicated part of the cgroup hierarchy it manages. Operators are
required to drain their nodes prior to upgrade of the `kubelet` from prior
versions in order to ensure pods and their associated containers are launched in
the proper part of the cgroup hierarchy.