When you perform node management operations, the CLI interacts with node objects that are representations of actual node hosts. The master uses the information from node objects to validate nodes with health checks.
To list all nodes that are known to the master:
$ oc get nodes NAME STATUS ROLES AGE VERSION master.example.com Ready master 7h v1.9.1+a0ce1bc657 node1.example.com Ready compute 7h v1.9.1+a0ce1bc657 node2.example.com Ready compute 7h v1.9.1+a0ce1bc657
To only list information about a single node, replace <node>
with the full
node name:
$ oc get node <node>
The STATUS
column in the output of these commands can show nodes with the
following conditions:
Condition | Description |
---|---|
|
The node is passing the health checks performed from the master by returning
|
|
The node is not passing the health checks performed from the master. |
|
Pods cannot be scheduled for placement on the node. |
Note
|
The STATUS column can also show Unknown for a node if the CLI cannot
find any node condition.
|
To get more detailed information about a specific node, including the reason for the current condition:
$ oc describe node <node>
For example:
# oc describe node node1.example.com Name: node1.example.com (1) Roles: master (2) Labels: beta.kubernetes.io/arch=amd64 (3) beta.kubernetes.io/os=linux glusterfs=storage-host kubernetes.io/hostname=node1.example.com node-role.kubernetes.io/master=true region=infra zone=default Annotations: volumes.kubernetes.io/controller-managed-attach-detach=true (4) Taints: <none> (5) CreationTimestamp: Sat, 07 Apr 2018 16:49:31 +0530 Conditions: (6) Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- OutOfDisk False Wed, 30 May 2018 14:19:56 +0530 Sat, 07 Apr 2018 16:49:31 +0530 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Wed, 30 May 2018 14:19:56 +0530 Sat, 07 Apr 2018 16:49:31 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 30 May 2018 14:19:56 +0530 Sat, 07 Apr 2018 16:49:31 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure Ready True Wed, 30 May 2018 14:19:56 +0530 Fri, 11 May 2018 15:09:42 +0530 KubeletReady kubelet is posting ready status Addresses: (7) InternalIP: 10.74.2.51 Hostname: node1.example.com Capacity: (8) cpu: 2 memory: 16208764Ki pods: 20 Allocatable: cpu: 2 memory: 16106364Ki pods: 20 System Info: (9) Machine ID: edd6dd9a8530d2c8c442c2d4e7f System UUID: EDD6DD9A-853BD2C-8C442C2D4E7F Boot ID: 750641e2-1b5b02b-ee068821264e Kernel Version: 3.10.0-693.el7.x86_64 OS Image: OpenShift Enterprise Operating System: linux Architecture: amd64 Container Runtime Version: docker://1.13.1 Kubelet Version: v1.9.1+a0ce1bc657 Kube-Proxy Version: v1.9.1+a0ce1bc657 ExternalID: vm252-51.example.com Non-terminated Pods: (6 in total) (10) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits --------- ---- ------------ ---------- --------------- ------------- default docker-registry-1-prph7 100m (5%) 0 (0%) 256Mi (1%) 0 (0%) default router-1-z625j 100m (5%) 0 (0%) 256Mi (1%) 0 (0%) glusterfs glusterfs-storage-8h59h 100m (5%) 0 (0%) 100Mi (0%) 0 (0%) openshift-metrics prometheus-0 1 (50%) 1 (50%) 1Gi (6%) 1Gi (6%) openshift-metrics prometheus-node-exporter-j5hbp 100m (5%) 200m (10%) 30Mi (0%) 50Mi (0%) openshift-web-console webconsole-6c488985dd-4txcz 100m (5%) 0 (0%) 100Mi (0%) 0 (0%) Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) CPU Requests CPU Limits Memory Requests Memory Limits ------------ ---------- --------------- ------------- 1500m (75%) 1200m (60%) 1766Mi (11%) 1074Mi (6%) Events: <none>
You can display usage statistics about nodes, which provide the runtime environments for containers. These usage statistics include CPU, memory, and storage consumption.
To view the usage statistics:
$ oc adm top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% node-1 297m 29% 4263Mi 55% node-0 55m 5% 1201Mi 15% infra-1 85m 8% 1319Mi 17% infra-0 182m 18% 2524Mi 32% master-0 178m 8% 2584Mi 16%
To view the usage statistics for nodes with labels:
$ oc adm top node --selector=''
You must choose the selector (label query) to filter on. Supports =
, ==
, and !=
.
Note
|
You must have |
Note
|
Metrics must be installed to view the usage statistics. |
When you delete a node using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node itself are not deleted. Any bare pods not backed by a replication controller would be inaccessible to {product-title}, pods backed by replication controllers would be rescheduled to other available nodes, and local manifest pods would need to be manually deleted.
To delete a node from the {product-title} cluster:
-
Evacuate pods from the node you are preparing to delete.
-
Delete the node object:
$ oc delete node <node>
-
Check that the node has been removed from the node list:
$ oc get nodes
Pods should now be only scheduled for the remaining nodes that are in Ready state.
-
If you want to uninstall all {product-title} content from the node host, including all pods and containers, continue to Uninstalling Nodes and follow the procedure using the uninstall.yml playbook. The procedure assumes general understanding of the cluster installation process using Ansible.
To add or update labels on a node:
$ oc label node <node> <key_1>=<value_1> ... <key_n>=<value_n>
To see more detailed usage:
$ oc label -h
To list all or selected pods on one or more nodes:
$ oc adm manage-node <node1> <node2> \ --list-pods [--pod-selector=<pod_selector>] [-o json|yaml]
To list all or selected pods on selected nodes:
$ oc adm manage-node --selector=<node_selector> \ --list-pods [--pod-selector=<pod_selector>] [-o json|yaml]
By default, healthy nodes with a Ready
status are
marked as schedulable, meaning that new pods are allowed for placement on the
node. Manually marking a node as unschedulable blocks any new pods from being
scheduled on the node. Existing pods on the node are not affected.
To mark a node or nodes as unschedulable:
$ oc adm manage-node <node1> <node2> --schedulable=false
For example:
$ oc adm manage-node node1.example.com --schedulable=false NAME LABELS STATUS node1.example.com kubernetes.io/hostname=node1.example.com Ready,SchedulingDisabled
To mark a currently unschedulable node or nodes as schedulable:
$ oc adm manage-node <node1> <node2> --schedulable
Alternatively, instead of specifying specific node names (e.g., <node1>
<node2>
), you can use the --selector=<node_selector>
option to mark selected
nodes as schedulable or unschedulable.
Evacuating pods allows you to migrate all or selected pods from a given node or nodes. Nodes must first be marked unschedulable to perform pod evacuation.
Only pods backed by a replication controller can be evacuated; the replication controllers create new pods on other nodes and remove the existing pods from the specified node(s). Bare pods, meaning those not backed by a replication controller, are unaffected by default.
To evacuate all or selected pods on one or more nodes:
$ oc adm drain <node1> <node2> [--pod-selector=<pod_selector>]
You can force deletion of bare pods by using the --force
option. When set to
true
, deletion continues even if there are pods not managed by a replication
controller, ReplicaSet, job, daemonset, or StatefulSet:
$ oc adm drain <node1> <node2> --force=true
You can use --grace-period
to set a period of time in seconds for each pod to
terminate gracefully. If negative, the default value specified in the pod will
be used:
$ oc adm drain <node1> <node2> --grace-period=-1
You can use --ignore-daemonsets
and set it to true
to ignore
daemonset-managed pods:
$ oc adm drain <node1> <node2> --ignore-daemonsets=true
You can use --timeout
to set the length of time to wait before giving up. A
value of 0
sets an infinite length of time:
$ oc adm drain <node1> <node2> --timeout=5s
You can use --delete-local-data
and set it to true
to continue deletion even
if there are pods using emptyDir (local data that will be deleted when the node
is drained):
$ oc adm drain <node1> <node2> --delete-local-data=true
To list objects that will be migrated without actually performing the evacuation,
use the --dry-run
option and set it to true
:
$ oc adm drain <node1> <node2> --dry-run=true
Instead of specifying specific node names (for example, <node1> <node2>
), you
can use the --selector=<node_selector>
option to evacuate pods on selected
nodes.
To reboot a node without causing an outage for applications running on the platform, it is important to first evacuate the pods. For pods that are made highly available by the routing tier, nothing else needs to be done. For other pods needing storage, typically databases, it is critical to ensure that they can remain in operation with one pod temporarily going offline. While implementing resiliency for stateful pods is different for each application, in all cases it is important to configure the scheduler to use node anti-affinity to ensure that the pods are properly spread across available nodes.
Another challenge is how to handle nodes that are running critical infrastructure such as the router or the registry. The same node evacuation process applies, though it is important to understand certain edge cases.
Infrastructure nodes are nodes that are labeled to run pieces of the {product-title} environment. Currently, the easiest way to manage node reboots is to ensure that there are at least three nodes available to run infrastructure. The scenario below demonstrates a common mistake that can lead to service interruptions for the applications running on {product-title} when only two nodes are available.
-
Node A is marked unschedulable and all pods are evacuated.
-
The registry pod running on that node is now redeployed on node B. This means node B is now running both registry pods.
-
Node B is now marked unschedulable and is evacuated.
-
The service exposing the two pod endpoints on node B, for a brief period of time, loses all endpoints until they are redeployed to node A.
The same process using three infrastructure nodes does not result in a service disruption. However, due to pod scheduling, the last node that is evacuated and brought back in to rotation is left running zero registries. The other two nodes will run two and one registries respectively. The best solution is to rely on pod anti-affinity. This is an alpha feature in Kubernetes that is available for testing now, but is not yet supported for production workloads.
Pod anti-affinity is slightly different than node anti-affinity. Node anti-affinity can be violated if there are no other suitable locations to deploy a pod. Pod anti-affinity can be set to either required or preferred.
Using the docker-registry
pod as an example, the first step in enabling
this feature is to set the scheduler.alpha.kubernetes.io/affinity
on the
pod. Since this pod uses a deployment configuration, the most appropriate
place to add the annotation is to the pod template’s metadata.
$ oc edit dc/docker-registry -o yaml ... template: metadata: annotations: scheduler.alpha.kubernetes.io/affinity: | { "podAntiAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": [{ "labelSelector": { "matchExpressions": [{ "key": "docker-registry", "operator": "In", "values":["default"] }] }, "topologyKey": "kubernetes.io/hostname" }] } }
Important
|
|
This example assumes the Docker registry pod has a label of
docker-registry=default
. Pod anti-affinity can use any Kubernetes match
expression.
The last required step is to enable the MatchInterPodAffinity
scheduler
predicate in /etc/origin/master/scheduler.json. With this in place, if only
two infrastructure nodes are available and one is rebooted, the Docker registry
pod is prevented from running on the other node. oc get pods
reports the pod
as unready until a suitable node is available. Once a node is available and all
pods are back in ready state, the next node can be restarted.
In most cases, a pod running an {product-title} router will expose a host port.
The PodFitsPorts
scheduler predicate ensures that no router pods using the
same port can run on the same node, and pod anti-affinity is achieved. If the
routers are relying on
IP failover
for high availability, there is nothing else that is needed. For router pods
relying on an external service such as AWS Elastic Load Balancing for high
availability, it is that service’s responsibility to react to router pod
restarts.
In rare cases, a router pod may not have a host port configured. In those cases, it is important to follow the recommended restart process for infrastructure nodes.
You can configure node resources by adding kubelet arguments to the node
configuration file (/etc/origin/node/node-config.yaml). Add the
kubeletArguments
section and include any desired options:
kubeletArguments: max-pods: (1) - "40" resolv-conf: (2) - "/etc/resolv.conf" image-gc-high-threshold: (3) - "90" image-gc-low-threshold: (4) - "80"
-
Resolver configuration file used as the basis for the container DNS resolution configuration.
-
The percent of disk usage after which image garbage collection is always run. Default: 90%
-
The percent of disk usage before which image garbage collection is never run. Lowest disk usage to garbage collect to. Default: 80%
To view all available kubelet options:
$ kubelet -h
This can also be set during an
cluster installation
using the openshift_node_kubelet_args
variable. For example:
openshift_node_kubelet_args={'max-pods': ['40'], 'resolv-conf': ['/etc/resolv.conf'], 'image-gc-high-threshold': ['90'], 'image-gc-low-threshold': ['80']}
Note
|
See the Cluster Limits page for the maximum supported limits for each version of {product-title}. |
In the /etc/origin/node/node-config.yaml file, two parameters control the
maximum number of pods that can be scheduled to a node: pods-per-core
and
max-pods
. When both options are in use, the lower of the two limits the number
of pods on a node. Exceeding these values can result in:
-
Increased CPU utilization on both {product-title} and Docker.
-
Slow pod scheduling.
-
Potential out-of-memory scenarios (depends on the amount of memory in the node).
-
Exhausting the pool of IP addresses.
-
Resource overcommitting, leading to poor user application performance.
Note
|
In Kubernetes, a pod that is holding a single container actually uses two containers. The second container is used to set up networking prior to the actual container starting. Therefore, a system running 10 pods will actually have 20 containers running. |
pods-per-core
sets the number of pods the node can run based on the number of
processor cores on the node. For example, if pods-per-core
is set to 10
on
a node with 4 processor cores, the maximum number of pods allowed on the node
will be 40.
kubeletArguments: pods-per-core: - "10"
Note
|
Setting |
max-pods
sets the number of pods the node can run to a fixed value, regardless
of the properties of the node.
Cluster
Limits documents maximum supported values for max-pods
.
kubeletArguments: max-pods: - "250"
Using the above example, the default value for pods-per-core
is 10
and the
default value for max-pods
is 250
. This means that unless the node has 25
cores or more, by default, pods-per-core
will be the limiting factor.
As you download Docker images and run and delete containers, Docker does not always free up mapped disk space. As a result, over time you can run out of space on a node, which might prevent {product-title} from being able to create new pods or cause pod creation to take several minutes.
For example, the following shows pods that are still in the ContainerCreating
state after six minutes and the events log shows a FailedSync event.
$ oc get pod
NAME READY STATUS RESTARTS AGE
cakephp-mysql-persistent-1-build 0/1 ContainerCreating 0 6m
mysql-1-9767d 0/1 ContainerCreating 0 2m
mysql-1-deploy 0/1 ContainerCreating 0 6m
$ oc get events
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
6m 6m 1 cakephp-mysql-persistent-1-build Pod Normal Scheduled default-scheduler Successfully assigned cakephp-mysql-persistent-1-build to ip-172-31-71-195.us-east-2.compute.internal
2m 5m 4 cakephp-mysql-persistent-1-build Pod Warning FailedSync kubelet, ip-172-31-71-195.us-east-2.compute.internal Error syncing pod
2m 4m 4 cakephp-mysql-persistent-1-build Pod Normal SandboxChanged kubelet, ip-172-31-71-195.us-east-2.compute.internal Pod sandbox changed, it will be killed and re-created.
One solution to this problem is to reset Docker storage to remove artifacts not needed by Docker.
On the node where you want to restart Docker storage:
-
Run the following command to mark the node as unschedulable:
$ oadm manage-node <node> --schedulable=false
-
Run the following command to shut down Docker and the atomic-openshift-node service:
$ systemctl stop docker atomic-openshift-node
-
Run the following command to remove the local volume directory:
$ rm -rf /var/lib/origin/openshift.local.volumes
This command clears the local image cache. As a result, images, including
ose-*
images, will need to be re-pulled. This might result in slower pod start times while the image store recovers. -
Remove the /var/lib/docker directory:
$ rm -rf /var/lib/docker
-
Run the following command to reset the Docker storage:
$ docker-storage-setup --reset
-
Run the following command to recreate the Docker storage:
$ docker-storage-setup
-
Recreate the /var/lib/docker directory:
$ mkdir /var/lib/docker
-
Run the following command to restart Docker and the atomic-openshift-node service:
$ systemctl start docker atomic-openshift-node
-
Run the following command to mark the node as schedulable:
$ oadm manage-node <node> --schedulable=true