在容器发生OOMKilling时，如何让node-problen-detector向apiServer发送event时添加pod信息，以便获取到具体的pod发生OOM #219

lee-lib · 2022-02-15T01:36:44Z

参考阿里容器服务ack文档 https://help.aliyun.com/knowledge_detail/178479.html
文档中描述在2020年07月的镜像版本registry.aliyuncs.com/acs/node-problem-detector:v0.6.3-28-160499f中就可以为oomkilling事件添加pod信息，
我这边是按照此版本的node-problem-detector镜像构建的容器，但是模拟触发oomkill事件时，还是无法获取到pod信息，只能获取到node类型的信息，yaml文件如下
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-problem-detector
namespace: kube-system
labels:
app: node-problem-detector
spec:
selector:
matchLabels:
app: node-problem-detector
template:
metadata:
labels:
app: node-problem-detector
spec:
containers:
- name: node-problem-detector
command:
- /node-problem-detector
- --logtostderr
- --system-log-monitors=/config/kernel-monitor.json,/config/docker-monitor.json
- --apiserver-override=http://192.168.1.228:8080?inClusterConfig=false
image: registry.aliyuncs.com/acs/node-problem-detector:v0.6.3-28-160499f
resources:
limits:
cpu: 10m
memory: 80Mi
requests:
cpu: 10m
memory: 80Mi
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: log
mountPath: /var/log
readOnly: true
- name: kmsg
mountPath: /dev/kmsg
readOnly: true
# Make sure node problem detector is in the same timezone
# with the host.
- name: localtime
mountPath: /etc/localtime
readOnly: true
- name: config
mountPath: /config
readOnly: true
volumes:
- name: log
# Config log to your system log directory
hostPath:
path: /var/log/
- name: kmsg
hostPath:
path: /dev/kmsg
- name: localtime
hostPath:
path: /etc/localtime
- name: config
configMap:
name: node-problem-detector-config
items:
- key: kernel-monitor.json
path: kernel-monitor.json
- key: docker-monitor.json
path: docker-monitor.json

模拟oom产生的日志如下，其中involvedObject.kind信息还是Node，无法获取到Pod信息
I0214 18:22:00.008987 1 mysql.go:73] {
"metadata": {
"name": "k8s-master.16d39fea6f67823e",
"namespace": "default",
"selfLink": "/api/v1/namespaces/default/events/k8s-master.16d39fea6f67823e",
"uid": "9c9cdb3b-240a-40e8-9db2-b4490dfc4f42",
"resourceVersion": "8981033",
"creationTimestamp": "2022-02-14T10:21:58Z",
"managedFields": [
{
"manager": "node-problem-detector",
"operation": "Update",
"apiVersion": "v1",
"time": "2022-02-14T10:21:58Z",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:count": {},
"f:firstTimestamp": {},
"f:involvedObject": {
"f:kind": {},
"f:name": {},
"f:uid": {}
},
"f:lastTimestamp": {},
"f:message": {},
"f:reason": {},
"f:source": {
"f:component": {},
"f:host": {}
},
"f:type": {}
}
}
]
},
"involvedObject": {
"kind": "Node",
"name": "k8s-master",
"uid": "k8s-master"
},
"reason": "OOMKilling",
"message": "Memory cgroup out of memory: Kill process 2235 (stress) score 0 or sacrifice child\nKilled process 2235 (stress) total-vm:515612kB, anon-rss:168728kB, file-rss:32kB, shmem-rss:0kB",
"source": {
"component": "kernel-monitor",
"host": "k8s-master"
},
"firstTimestamp": "2022-02-14T10:21:58Z",
"lastTimestamp": "2022-02-14T10:21:58Z",
"count": 1,
"type": "Warning",
"eventTime": null,
"reportingComponent": "",
"reportingInstance": ""
}

请问，开发者在容器发生后OOMKilling时，如何配置node-problem-detector.yaml和kube-eventer.yaml文件才能获取到Pod信息？

The text was updated successfully, but these errors were encountered:

ringtail · 2022-05-07T09:24:48Z

阿里云的这个功能是在NPD中实现的，目前社区的OOM是在Node维度透出Warning事件。

crushCoin · 2022-05-11T01:36:50Z

https://github.com/AliyunContainerService/node-problem-detector/blob/master/pkg/systemlogmonitor/log_monitor.go#L67 这里pod uid正则的原因
Cgroup Driver 为 systemd 可以；Cgroup Driver 为 cgroupfs 不行
修改源码重新编译（git clone -b alibabacloud-v0.8.10 https://githubfast.com/AliyunContainerService/node-problem-detector.git）
1、修改为 uuidRegx = regexp.MustCompile("[0-9a-f]{8}[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{12}|[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}")
2、https://github.com/AliyunContainerService/node-problem-detector/blob/master/pkg/systemlogmonitor/log_monitor.go#L216
修改为（可以将 PodOOMKilling 和 OOMKilling 日志放在一起）
message = fmt.Sprintf("pod was OOM killed. node:%s pod:%s namespace:%s uuid:%s\n",
pod.Spec.NodeName, pod.Name, pod.Namespace, uuid) + message
3、修改 kernel-monitor.json （"bufferSize": 30，这个是用于日志匹配的环形队列缓冲buffer 长度，可以适当改大）
rules 中增加
{
"type": "temporary",
"reason": "PodOOMKilling",
"pattern": "Task in /kubepods(.+) killed as a result of limit of .(\n.)+ Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.*"
}

上面有转义看图

ringtail closed this as completed May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在容器发生OOMKilling时，如何让node-problen-detector向apiServer发送event时添加pod信息，以便获取到具体的pod发生OOM #219

在容器发生OOMKilling时，如何让node-problen-detector向apiServer发送event时添加pod信息，以便获取到具体的pod发生OOM #219

lee-lib commented Feb 15, 2022 •

edited

Loading

ringtail commented May 7, 2022

crushCoin commented May 11, 2022 •

edited

Loading

在容器发生OOMKilling时，如何让node-problen-detector向apiServer发送event时添加pod信息，以便获取到具体的pod发生OOM #219

在容器发生OOMKilling时，如何让node-problen-detector向apiServer发送event时添加pod信息，以便获取到具体的pod发生OOM #219

Comments

lee-lib commented Feb 15, 2022 • edited Loading

ringtail commented May 7, 2022

crushCoin commented May 11, 2022 • edited Loading

lee-lib commented Feb 15, 2022 •

edited

Loading

crushCoin commented May 11, 2022 •

edited

Loading