Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for TaintNodesByCondition #5352

Merged
merged 7 commits into from
Sep 12, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Add documentation for TaintNodesByCondition
  • Loading branch information
Marek Grabowski committed Sep 11, 2017
commit 268411f6ab7d5ef01bb60ff922ff36894e8f0c31
20 changes: 19 additions & 1 deletion docs/concepts/architecture/nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,22 @@ The node condition is represented as a JSON object. For example, the following r

If the Status of the Ready condition is "Unknown" or "False" for longer than the `pod-eviction-timeout`, an argument is passed to the [kube-controller-manager](/docs/admin/kube-controller-manager) and all of the Pods on the node are scheduled for deletion by the Node Controller. The default eviction timeout duration is **five minutes**. In some cases when the node is unreachable, the apiserver is unable to communicate with the kubelet on it. The decision to delete the pods cannot be communicated to the kubelet until it re-establishes communication with the apiserver. In the meantime, the pods which are scheduled for deletion may continue to run on the partitioned node.

In versions of Kubernetes prior to 1.5, the node controller would [force delete](/docs/concepts/workloads/pods/pod/#force-deletion-of-pods) these unreachable pods from the apiserver. However, in 1.5 and higher, the node controller does not force delete pods until it is confirmed that they have stopped running in the cluster. One can see these pods which may be running on an unreachable node as being in the "Terminating" or "Unknown" states. In cases where Kubernetes cannot deduce from the underlying infrastructure if a node has permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects running on it to be deleted from the apiserver, freeing up their names.
In versions of Kubernetes prior to 1.5, the node controller would [force delete](/docs/concepts/workloads/pods/pod/#force-deletion-of-pods)
these unreachable pods from the apiserver. However, in 1.5 and higher, the node controller does not force delete pods until it is
confirmed that they have stopped running in the cluster. One can see these pods which may be running on an unreachable node as being in
the "Terminating" or "Unknown" states. In cases where Kubernetes cannot deduce from the underlying infrastructure if a node has
permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from
Kubernetes causes all the Pod objects running on it to be deleted from the apiserver, freeing up their names.

In version 1.8 a possibility to automatically create [taints](/docs/concepts/configuration/taint-and-toleration) representing Conditions
was added as an alpha feature. Enabling it makes scheduler ignore Conditions when considering a Node, instead it looks at the taints and
Pod's tolerations. This allows users to decide whether they want to keep old behavior and don't schedule their Pods on Nodes with some
Conditions, or rather corresponding taints, or if they want to add a toleration and allow it. To enable this behavior you need to pass
an additional feature gate flag `--feature-gates=...,TaintNodesByCondition=true` to apiserver, controller-manager and scheduler.

Note that because of small delay
(usually <1s) between time when Condition is observed and Taint is created it's possible that enabling this feature will slightly
increase number of Pods that are successfully scheduled but rejected by the Kubelet.

### Capacity

Expand Down Expand Up @@ -174,6 +189,9 @@ NodeController is responsible for adding taints corresponding to node problems l
node unreachable or not ready. See [this documentation](/docs/concepts/configuration/taint-and-toleration)
for details about `NoExecute` taints and the alpha feature.

Since Kubernetes 1.8 NodeController may be made responsible for creating taints represeting
Node Conditions. This is an alpha feature as of 1.8.

### Self-Registration of Nodes

When the kubelet flag `--register-node` is true (the default), the kubelet will attempt to
Expand Down
20 changes: 17 additions & 3 deletions docs/concepts/configuration/taint-and-toleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,9 +249,23 @@ admission controller](https://git.k8s.io/kubernetes/plugin/pkg/admission/default

* `node.alpha.kubernetes.io/unreachable`
* `node.alpha.kubernetes.io/notReady`
* `node.kubernetes.io/memoryPressure`
* `node.kubernetes.io/diskPressure`
* `node.kubernetes.io/outOfDisk` (*only for critical pods*)

This ensures that DaemonSet pods are never evicted due to these problems,
which matches the behavior when this feature is disabled.
Copy link
Member

@yastij yastij Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that the admin can always add other tolerations for ds if he wants ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a line about adding tolerations to arbitrary Pods. I can add explicit mention of DeamonSets, but I'm not sure where and how to phrase it.

Copy link
Member

@yastij yastij Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think expliciting it here should be just fine, maybe like #49784(comment) ?


## Taint Nodes by Condition

In Kubernetes 1.8 we added an alpha feature that makes NodeController create taints corresponding to node conditions, and disables the
condition check in the scheduler (instead the scheduler checks the taints). This assures that Conditions don't affect what's scheduled
onto the node and the user can choose to ignore some of the node's problems (represented as Conditions) by adding appropriate pod
tolerations.

To make sure that turning on this feature doesn't break Daemon sets from 1.8 DaemonSet controller will automatically add following
`NoSchedule` tolerations to all deamons:

* `node.kubernetes.io/memory-pressure`
* `node.kubernetes.io/disk-pressure`
* `node.kubernetes.io/out-of-disk` (*only for critical pods*)

Above settings are ones that keep backward compatibility, but we understand they may not fit all user's use cases, which is why cluster
admin may choose to add arbitrary tolerations to DaemonSets.
16 changes: 9 additions & 7 deletions docs/concepts/workloads/controllers/daemonset.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,19 +103,21 @@ but they are created with `NoExecute` tolerations for the following taints with

- `node.alpha.kubernetes.io/notReady`
- `node.alpha.kubernetes.io/unreachable`
- `node.alpha.kubernetes.io/memoryPressure`
- `node.alpha.kubernetes.io/diskPressure`

When the support to critical pods is enabled and the pods in a DaemonSet are
labelled as critical, the Daemon pods are created with an additional
`NoExecute` toleration for the `node.alpha.kubernetes.io/outOfDisk` taint with
no `tolerationSeconds`.

This ensures that when the `TaintBasedEvictions` alpha feature is enabled,
they will not be evicted when there are node problems such as a network partition. (When the
`TaintBasedEvictions` feature is not enabled, they are also not evicted in these scenarios, but
due to hard-coded behavior of the NodeController rather than due to tolerations).

They also tolerate following `NoSchedule` taints:
- `node.kubernetes.io/memory-pressure`
- `node.kubernetes.io/disk-pressure`

When the support to critical pods is enabled and the pods in a DaemonSet are
labelled as critical, the Daemon pods are created with an additional
`NoSchedule` toleration for the `node.kubernetes.io/out-of-disk` taint.

Note that all above `NoSchedule` taints above are created only in version 1.8 or leater if alpha feature `TaintNodesByCondition` is enabled.

## Communicating with Daemon Pods

Expand Down