Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding docs for node allocatable #2649

Merged
merged 3 commits into from
Mar 15, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions _data/guides.yml
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,7 @@ toc:
- docs/admin/cluster-management.md
- docs/admin/kubeadm.md
- docs/admin/addons.md
- docs/admin/node-allocatable.md
- docs/admin/audit.md
- docs/admin/ha-master-gce.md
- docs/admin/namespaces/index.md
Expand Down
144 changes: 144 additions & 0 deletions docs/admin/node-allocatable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
---
assignees:
-vishh
-derekwaynecarr
-dashpole
title: Reserving Compute Resources for System Daemons
---

* TOC
{:toc}

Kubernetes nodes can be scheduled to `capacity`.
Pods can consume all the available capacity on a node by default.
This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself.
Unless resources are set aside for these system daemons, pods and system daemons will compete for resources and lead to resource starvation issues on the node.
The `kubelet` exposes a feature named `Node Allocatable` that helps to reserve compute resources for system daemons.
Kubernetes recommends cluster administrators to configure `Node Allocatable` based on their workload density on each node.

## Node Allocatable

Node Capacity
---------------------------
| kube-reserved |
|-------------------------|
| system-reserved |
|-------------------------|
| eviction-threshold |
|-------------------------|
| |
| allocatable |
| (available |
| for pods) |
| |
| |
---------------------------

`Allocatable` on a Kubernetes node is defined as the amount of compute resources that are available for pods.
The scheduler does not over subscribe `allocatable`.
`CPU` and `memory` are supported as of now.
Support for `storage` will be added in the future.

Node Allocatable is exposed as part of `v1.Node` object in the API and as part of `kubectl describe node` in the CLI.

Resources can be reserved for two categories of system daemons in the `kubelet`.

### Kube Reserved

**Kubelet Flag**: `--kube-reserved=[cpu=100mi][,][memory=100Mi]`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100m,100Mi

**Kubelet Flag**: `--kube-reserved-cgroup=`/runtime.slice`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the name you are using in your images?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make clear that /runtime.slice is not the kubelet default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the defaults. I hope I have made the defaults clear this time around. PTAL


`kube-reserved` is meant to capture resource reservation for kubernetes system daemons like the `kubelet`, `container runtime`, `node problem detector`, etc.
It is not meant to reserve resources for system daemons that are run as pods.
`kube-reserved` is typically a function of `pod density` on the nodes.
[This performance dashboard](http://node-perf-dash.k8s.io/#/builds) exposes `cpu` and `memory` usage profiles of `kubelet` and `docker engine` at multiple levels of pod density.
[This blog post](http://blog.kubernetes.io/2016/11/visualize-kubelet-performance-with-node-dashboard.html) explains how the dashboard can be interpreted to come up with a suitable `kube-reserved` reservation.

It is recommended that the kubernetes system daemons are placed under a top level control group (`system.slice` on systemd machines for example).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this text should be in system reserved section.

you should have text specific to kube daemons here..

Each system daemon should ideally run within its own child control group.
Refer to [this doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md#recommended-cgroups-setup) for more details on recommended control group hierarchy.

To optionally enforce `kube-reserved` on system daemons, specify the parent control group for kube daemons as the value for `--kube-reserved-cgroup` kubelet flag.

### System Reserved

**Kubelet Flag**: `--system-reserved=[cpu=100mi][,][memory=100Mi]`
**Kubelet Flag**: `--system-reserved-cgroup=`/system.slice`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make clear this flag has no default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make clear that the kubelet doesnt create either of these two cgroups.



`system-reserved` is meant to capture resource reservation for OS system daemons like `sshd`, `udev`, etc.
`system-reserved` should reserve `memory` for the `kernel` too since `kernel` memory is not accounted to pods (yet) in Kubernetes.
Reserving resources for user login sessions is also recommended (`user.slice` in systemd world).

To optionally enforce `system-reserved` on system daemons, specify the parent control group for OS system daemons as the value for `--system-reserved-cgroup` kubelet flag.

### Eviction Thresholds

**Kubelet Flag**: `--eviction-hard=[memory.available<500Mi]`

Memory pressure at the node level leads to System OOMs which affects the entire node and all pods running on it.
Nodes can go offline temporarily until memory has been reclaimed.
To avoid (or reduce the probabilty) system OOMs kubelet provides [`Out of Resource`](./out-of-resource.md) management.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: probability

Evictions are supported for `memory` and `storage` only.
By reserving some memory via `--eviction-hard` flag, the `kubelet` attempts to `evict` pods whenever memory availability on the node drops below the reserved value.
Hypothetically, if system daemons did not exist on a node, pods cannot use more than `capacity - eviction-hard`.
For this reason, resources reserved for evictions will not be available for pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to schedule against?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduling is meant to be implicit since pods can be placed directly on nodes bypassing the scheduler


### Enforcing Node Allocatable

**Kubelet Flag**: `--enforce-node-allocatable=[pods][,][system-reserved][,][kube-reserved]`

The scheduler will treat `Allocatable` as the available `capacity` for pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would remove the use of will style phrasing in the document as we are describing the present in this doc.

The scheduler treats 'Allocatable'...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack


`kubelet` will enforce `Allocatable` across pods by default.
This enforcement is controlled by specifying `pods` value to the kubelet flag `--enforce-node-allocatable`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this is the default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe explain what enforcement means? for example, by enforcing at this level, we ensure pods cannot consume more memory and cpu time than allocated?


Optionally, `kubelet` can be made to enforce `kube-reserved` and `system-reserved` by specifying `kube-reserved` & `system-reserved` values in the same flag.
Note that to enforce `kube-reserved` or `system-reserved`, `--kube-reserved-cgroup` or `--system-reserved-cgroup` needs to be specified respectively.

## General Guidelines

System daemons are expected to be treated similar to `Guaranteed` pods.
System daemons can burst within their bounding control groups and this behavior needs to be managed as part of kubernetes deployments.
For example, `kubelet` should have its own control group and share `Kube-reserved` resources with the container runtime.
However, Kubelet cannot burst and use up all available Node resources if `kube-reserved` is enforced.

Be extra careful while enforcing `system-reserved` reservation since it can lead to critical system services being CPU starved or OOM killed on the node.
The recommendation is to enforce `system-reserved` only if a user has profiled their nodes exhaustively to come up with precise estimates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and is confident in their ability to recover if any item in that group is oom_killed.


* To begin with enforce `Allocatable` on `pods`.
* Once adequate monitoring and alerting is in place to track kube system daemons, attempt to enforce `kube-reserved` based on usage heuristics.
* If aboslutely necessary, enforce `system-reserved` over time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on absolutely


The resource requirements of kube system daemons will grow over time as more and more features are added.
Over time, kubernetes will attempt to bring down utilization of node system daemons, but that is not a priority as of now.
So expect a drop in `Allocatable` capacity in future releases.

## Example Scenario

Here is an example to illustrate Node Allocatable computation:

* Node has `32Gi` of `memory` and `16 CPUs`
* `--kube-reserved` is set to `cpu=1,memory=2Gi`
* `--system-reserved` is set to `cpu=500m,memory=1Gi`
* ``--eviction-hard` is set to `memory.available<500Mi`

Under this scenario, `Allocatable` will be `14.5 CPUs` & `28.5Gi` of memory.
Scheduler will ensure that the total `requests` across all pods on this node does not exceed `28.5Gi`.
Kubelet will evict pods whenever the overall memory usage exceeds across pods exceed `28.5Gi`.
If all processes on the node consume as much CPU as they can, pods together cannot consume more than `14.5 CPUs`.

If `kube-reserved` and/or `system-reserved` is not enforced and system daemons exceed their reservation, `kubelet` will evict pods whenever the overall node memory usage is higher than `31.5Gi`.

## Feature Availability

Since `v1.2`, it has been possible to **optionally** specify `kube-reserved` and `system-reserved` reservations.
The scheduler switched to using `Allocatable` instead of `Capacity` when available in the same release.

Since `v1.6`, `eviction-thresholds` are being considered by computing `Allocatable`.
To revert to the old behavior set `--experimental-allocatable-ignore-eviction` kubelet flag to `true`.

Since `v1.6`, `kubelet` will enforce `Allocatable` on pods using control groups.
To revert to the old behavior unset `--enforce-node-allocatable` kubelet flag.
Note that unless `--kube-reserved`, or `--system-reserved` or `--eviction-hard` flags have non-default values, `Allocatable` enforcement will not affect existing deployments.