The Node Problem Detector monitors the health of your nodes by finding certain problems and reporting these problems to the API server. The detector runs as a daemonset on each node.
The Node Problem Detector is a Technology Preview feature only. |
The Node Problem Detector reads system logs and watches for specific entries and makes these problems visible to the control plane,
which you can view using {product-title} commands, such as oc get node
and oc get event
You could then take action to correct these problems as appropriate or capture the messages using a tool of your choice,
such as the {product-title} log monitoring.
Detected problems can be in one of the following categories:
: A permanent problem that makes the node unavailable for pods. The node condition will not be cleared until the host is rebooted. -
: A temporary problem that has limited impact on a node, but is informative.
The Node Problem Detector can detect:
container runtime issues:
unresponsive runtime daemons
hardware issues:
bad CPU
bad memory
bad disk
kernel issues:
kernel deadlock conditions
corrupted file systems
unresponsive runtime daemons
infrastructure daemon issues:
NTP service outages
The following examples show output from the Node Problem Detector watching for kernel deadlock node condition on a specific node. The command
uses oc get node
to watch a specific node filtering for a KernelDeadlock
entry in a log.
# oc get node <node> -o yaml | grep -B5 KernelDeadlock
message: kernel has no deadlock reason: KernelHasNoDeadlock status: false type: KernelDeadLock
message: task docker:1234 blocked for more than 120 seconds reason: DockerHung status: true type: KernelDeadLock
This example shows output from the Node Problem Detector watching for events on a node.
The following command uses oc get event
against the default project watching for
events listed in the kernel-monitor.json
section of the
Node Problem Detector configuration map.
# oc get event -n default --field-selector=source=kernel-monitor --watch
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 2018-06-27 09:08:27 -0400 EDT 2018-06-27 09:08:27 -0400 EDT 1 my-node1 node Warning TaskHunk docker:1234 blocked for more than 300 seconds 2018-06-27 09:08:27 -0400 EDT 2018-06-27 09:08:27 -0400 EDT 3 my-node2 node Warning KernelOops BUG: unable to handle kernel NULL pointer deference at nowhere 2018-06-27 09:08:27 -0400 EDT 2018-06-27 09:08:27 -0400 EDT 1 my-node1 node Warning KernelOops divide error 0000 [#0] SMP
The Node Problem Detector consumes resources. If you use the Node Problem Detector, make sure you have enough nodes to balance cluster performance. |
If openshift_node_problem_detector_install
was set to true
in the /etc/ansible/hosts inventory file,
the installation creates
a Node Problem Detector daemonset by default and creates a project for the detector, called openshift-node-problem-detector
Because the Node Problem Detector is in Technology Preview, the |
If the Node Problem Detector is not installed, change to the playbook directory and run the openshift-node-problem-detector/config.yml playbook to install Node Problem Detector:
$ cd /usr/share/ansible/openshift-ansible $ ansible-playbook playbooks/openshift-node-problem-detector/config.yml
You can configure the Node Problem Detector to watch for any log string by editing the Node Problem Detector configuration map.
apiVersion: v1 kind: ConfigMap metadata: name: node-problem-detector data: docker-monitor.json: | (1) { "plugin": "journald", (2) "pluginConfig": { "source": "docker" }, "logPath": "/host/log/journal", (3) "lookback": "5m", "bufferSize": 10, "source": "docker-monitor", "conditions": [], "rules": [ (4) { "type": "temporary", (5) "reason": "CorruptDockerImage", (6) "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*" (7) } ] } kernel-monitor.json: | (8) { "plugin": "journald", (2) "pluginConfig": { "source": "kernel" }, "logPath": "/host/log/journal", (3) "lookback": "5m", "bufferSize": 10, "source": "kernel-monitor", "conditions": [ (4) { "type": "KernelDeadlock", (5) "reason": "KernelHasNoDeadlock", (6) "message": "kernel has no deadlock" (7) } ], "rules": [ { "type": "temporary", "reason": "OOMKilling", "pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB" }, { "type": "temporary", "reason": "TaskHung", "pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\." }, { "type": "temporary", "reason": "UnregisterNetDevice", "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+" }, { "type": "temporary", "reason": "KernelOops", "pattern": "BUG: unable to handle kernel NULL pointer dereference at .*" }, { "type": "temporary", "reason": "KernelOops", "pattern": "divide error: 0000 \\[#\\d+\\] SMP" }, { "type": "permanent", "condition": "KernelDeadlock", "reason": "AUFSUmountHung", "pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\." }, { "type": "permanent", "condition": "KernelDeadlock", "reason": "DockerHung", "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\." } ] }
Rules and conditions that apply to container images.
Monitoring services, in a comma-separated list.
Path to the monitoring service log.
List of events to be monitored.
Label to indicate the error is an event (
) or NodeCondition (permanent
). -
Text message to describe the error.
Error message that the Node Problem Detector watches for.
Rules and conditions that apply to the kernel.
To configure the Node Problem Detector, add or remove problem conditions and events.
Edit the Node Problem Detector configuration map with a text editor.
oc edit configmap -n openshift-node-problem-detector node-problem-detector
Remove, add, or edit any node conditions or events as needed.
{ "type": <`temporary` or `permanent`>, "reason": <free-form text describing the error>, "pattern": <log message to watch for> },
For example:
{ "type": "temporary", "reason": "UnregisterNetDevice", "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+" },
Restart running pods to apply the changes. To restart pods, you can delete all existing pods:
# oc delete pods -n openshift-node-problem-detector -l name=node-problem-detector
To display Node Problem Detector output to standard output (stdout) and standard error (stderr) add the following to the configuration map:
spec: template: spec: containers: - name: node-problem-detector command: - node-problem-detector - --alsologtostderr=true (1) - --log_dir="/tmp" (2) - --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json (3)
Sends the output to standard output (stdout).
Path to the error log.
Comma-separated path to the plug-in configuration files.
To verify that the Node Problem Detector is active:
Run the following command to get the name of the Problem Node Detector pod:
# oc get pods -n openshift-node-problem-detector NAME READY STATUS RESTARTS AGE node-problem-detector-8z8r8 1/1 Running 0 1h node-problem-detector-nggjv 1/1 Running 0 1h
Run the following command to view log information on the Problem Node Detector pod:
# oc logs -n openshift-node-problem-detector <pod_name>
The output should be similar to the following:
# oc logs -n openshift-node-problem-detector node-problem-detector-c6kng I0416 23:22:00.641354 1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]
Test the Node Problem Detector by simulating an event on the node:
# echo "kernel: divide error: 0000 [#0] SMP." >> /dev/kmsg
Test the Node Problem Detector by simulating a condition on the node:
# echo "kernel: task docker:7 blocked for more than 300 seconds." >> /dev/kmsg
To uninstall the Node Problem Detector:
Add following options in Ansible inventory file:
[OSEv3:vars] openshift_node_problem_detector_state=absent
Change to the playbook directory and run the config.yml Ansible playbook:
$ cd /usr/share/ansible/openshift-ansible $ ansible-playbook playbooks/openshift-node-problem-detector/config.yml