The Node Problem Detector monitors the health of your nodes by finding certain problems and reporting these problems to the API server. The detector runs as a daemonset on each node.
Important
|
The Node Problem Detector is a Technology Preview feature only. |
The Node Problem Detector reads system logs and watches for specific entries and makes these problems visible to the control plane,
which you can view using {product-title} commands, such as oc get node
and oc get event
.
You could then take action to correct these problems as appropriate or capture the messages using a tool of your choice,
such as the {product-title} log monitoring.
Detected problems can be in one of the following categories:
-
NodeCondition
: A permanent problem that makes the node unavailable for pods. The node condition will not be cleared until the host is rebooted. -
Event
: A temporary problem that has limited impact on a node, but is informative.
The Node Problem Detector can detect:
-
container runtime issues:
-
unresponsive runtime daemons
-
-
hardware issues:
-
bad CPU
-
bad memory
-
bad disk
-
-
kernel issues:
-
kernel deadlock conditions
-
corrupted file systems
-
unresponsive runtime daemons
-
-
infrastructure daemon issues:
-
NTP service outages
-
The following examples show output from the Node Problem Detector watching for kernel deadlock node condition on a specific node. The command
uses oc get node
to watch a specific node filtering for a KernelDeadlock
entry in a log.
# oc get node <node> -o yaml | grep -B5 KernelDeadlock
message: kernel has no deadlock reason: KernelHasNoDeadlock status: false type: KernelDeadLock
message: task docker:1234 blocked for more than 120 seconds reason: DockerHung status: true type: KernelDeadLock
This example shows output from the Node Problem Detector watching for events on a node.
The following command uses oc get event
against the default project watching for
events listed in the kernel-monitor.json
section of the
Node Problem Detector configuration map.
# oc get event -n default --field-selector=source=kernel-monitor --watch
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 2018-06-27 09:08:27 -0400 EDT 2018-06-27 09:08:27 -0400 EDT 1 my-node1 node Warning TaskHunk kernel-monitor.my-node1 docker:1234 blocked for more than 300 seconds 2018-06-27 09:08:27 -0400 EDT 2018-06-27 09:08:27 -0400 EDT 3 my-node2 node Warning KernelOops kernel-monitor.my-node2 BUG: unable to handle kernel NULL pointer deference at nowhere 2018-06-27 09:08:27 -0400 EDT 2018-06-27 09:08:27 -0400 EDT 1 my-node1 node Warning KernelOops kernel-monitor.my-node2 divide error 0000 [#0] SMP
Note
|
The Node Problem Detector consumes resources. If you use the Node Problem Detector, make sure you have enough nodes to balance cluster performance. |
If openshift_node_problem_detector_install
was set to true
in the /etc/ansible/hosts inventory file,
the installation creates
a Node Problem Detector daemonset by default and creates a project for the detector, called openshift-node-problem-detector
.
Note
|
Because the Node Problem Detector is in Technology Preview, the |
If the Node Problem Detector is not installed, change to the playbook directory and run the openshift-node-problem-detector/config.yml playbook to install Node Problem Detector:
$ cd /usr/share/ansible/openshift-ansible $ ansible-playbook playbooks/openshift-node-problem-detector/config.yml
You can configure the Node Problem Detector to watch for any log string by editing the Node Problem Detector configuration map.
apiVersion: v1 kind: ConfigMap metadata: name: node-problem-detector data: docker-monitor.json: | (1) { "plugin": "journald", (2) "pluginConfig": { "source": "docker" }, "logPath": "/host/log/journal", (3) "lookback": "5m", "bufferSize": 10, "source": "docker-monitor", "conditions": [], "rules": [ (4) { "type": "temporary", (5) "reason": "CorruptDockerImage", (6) "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*" (7) } ] } kernel-monitor.json: | (8) { "plugin": "journald", (2) "pluginConfig": { "source": "kernel" }, "logPath": "/host/log/journal", (3) "lookback": "5m", "bufferSize": 10, "source": "kernel-monitor", "conditions": [ (4) { "type": "KernelDeadlock", (5) "reason": "KernelHasNoDeadlock", (6) "message": "kernel has no deadlock" (7) } ], "rules": [ { "type": "temporary", "reason": "OOMKilling", "pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB" }, { "type": "temporary", "reason": "TaskHung", "pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\." }, { "type": "temporary", "reason": "UnregisterNetDevice", "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+" }, { "type": "temporary", "reason": "KernelOops", "pattern": "BUG: unable to handle kernel NULL pointer dereference at .*" }, { "type": "temporary", "reason": "KernelOops", "pattern": "divide error: 0000 \\[#\\d+\\] SMP" }, { "type": "permanent", "condition": "KernelDeadlock", "reason": "AUFSUmountHung", "pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\." }, { "type": "permanent", "condition": "KernelDeadlock", "reason": "DockerHung", "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\." } ] }
-
Rules and conditions that apply to container images.
-
Monitoring services, in a comma-separated list.
-
Path to the monitoring service log.
-
List of events to be monitored.
-
Label to indicate the error is an event (
temporary
) or NodeCondition (permanent
). -
Text message to describe the error.
-
Error message that the Node Problem Detector watches for.
-
Rules and conditions that apply to the kernel.
To configure the Node Problem Detector, add or remove problem conditions and events.
-
Edit the Node Problem Detector configuration map with a text editor.
oc edit configmap -n openshift-node-problem-detector node-problem-detector
-
Remove, add, or edit any node conditions or events as needed.
{ "type": <`temporary` or `permanent`>, "reason": <free-form text describing the error>, "pattern": <log message to watch for> },
For example:
{ "type": "temporary", "reason": "UnregisterNetDevice", "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+" },
-
Restart running pods to apply the changes. To restart pods, you can delete all existing pods:
# oc delete pods -n openshift-node-problem-detector -l name=node-problem-detector
-
To display Node Problem Detector output to standard output (stdout) and standard error (stderr) add the following to the configuration map:
spec: template: spec: containers: - name: node-problem-detector command: - node-problem-detector - --alsologtostderr=true (1) - --log_dir="/tmp" (2) - --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json (3)
-
Sends the output to standard output (stdout).
-
Path to the error log.
-
Comma-separated path to the plug-in configuration files.
-
To verify that the Node Problem Detector is active:
-
Run the following command to get the name of the Problem Node Detector pod:
# oc get pods -n openshift-node-problem-detector NAME READY STATUS RESTARTS AGE node-problem-detector-8z8r8 1/1 Running 0 1h node-problem-detector-nggjv 1/1 Running 0 1h
-
Run the following command to view log information on the Problem Node Detector pod:
# oc logs -n openshift-node-problem-detector <pod_name>
The output should be similar to the following:
# oc logs -n openshift-node-problem-detector node-problem-detector-c6kng I0416 23:22:00.641354 1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]
-
Test the Node Problem Detector by simulating an event on the node:
# echo "kernel: divide error: 0000 [#0] SMP." >> /dev/kmsg
-
Test the Node Problem Detector by simulating a condition on the node:
# echo "kernel: task docker:7 blocked for more than 300 seconds." >> /dev/kmsg
To uninstall the Node Problem Detector:
-
Add following options in Ansible inventory file:
[OSEv3:vars] openshift_node_problem_detector_state=absent
-
Change to the playbook directory and run the config.yml Ansible playbook:
$ cd /usr/share/ansible/openshift-ansible $ ansible-playbook playbooks/openshift-node-problem-detector/config.yml