Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在容器发生OOMKilling时,如何让node-problen-detector向apiServer发送event时添加pod信息,以便获取到具体的pod发生OOM #219

Closed
lee-lib opened this issue Feb 15, 2022 · 2 comments

Comments

@lee-lib
Copy link

lee-lib commented Feb 15, 2022

参考阿里容器服务ack文档 https://help.aliyun.com/knowledge_detail/178479.html
文档中描述在2020年07月的镜像版本registry.aliyuncs.com/acs/node-problem-detector:v0.6.3-28-160499f中就可以为oomkilling事件添加pod信息,
我这边是按照此版本的node-problem-detector镜像构建的容器,但是模拟触发oomkill事件时,还是无法获取到pod信息,只能获取到node类型的信息,yaml文件如下
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-problem-detector
namespace: kube-system
labels:
app: node-problem-detector
spec:
selector:
matchLabels:
app: node-problem-detector
template:
metadata:
labels:
app: node-problem-detector
spec:
containers:
- name: node-problem-detector
command:
- /node-problem-detector
- --logtostderr
- --system-log-monitors=/config/kernel-monitor.json,/config/docker-monitor.json
- --apiserver-override=http://192.168.1.228:8080?inClusterConfig=false
image: registry.aliyuncs.com/acs/node-problem-detector:v0.6.3-28-160499f
resources:
limits:
cpu: 10m
memory: 80Mi
requests:
cpu: 10m
memory: 80Mi
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: log
mountPath: /var/log
readOnly: true
- name: kmsg
mountPath: /dev/kmsg
readOnly: true
# Make sure node problem detector is in the same timezone
# with the host.
- name: localtime
mountPath: /etc/localtime
readOnly: true
- name: config
mountPath: /config
readOnly: true
volumes:
- name: log
# Config log to your system log directory
hostPath:
path: /var/log/
- name: kmsg
hostPath:
path: /dev/kmsg
- name: localtime
hostPath:
path: /etc/localtime
- name: config
configMap:
name: node-problem-detector-config
items:
- key: kernel-monitor.json
path: kernel-monitor.json
- key: docker-monitor.json
path: docker-monitor.json

模拟oom产生的日志如下,其中involvedObject.kind信息还是Node,无法获取到Pod信息
I0214 18:22:00.008987 1 mysql.go:73] {
"metadata": {
"name": "k8s-master.16d39fea6f67823e",
"namespace": "default",
"selfLink": "/api/v1/namespaces/default/events/k8s-master.16d39fea6f67823e",
"uid": "9c9cdb3b-240a-40e8-9db2-b4490dfc4f42",
"resourceVersion": "8981033",
"creationTimestamp": "2022-02-14T10:21:58Z",
"managedFields": [
{
"manager": "node-problem-detector",
"operation": "Update",
"apiVersion": "v1",
"time": "2022-02-14T10:21:58Z",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:count": {},
"f:firstTimestamp": {},
"f:involvedObject": {
"f:kind": {},
"f:name": {},
"f:uid": {}
},
"f:lastTimestamp": {},
"f:message": {},
"f:reason": {},
"f:source": {
"f:component": {},
"f:host": {}
},
"f:type": {}
}
}
]
},
"involvedObject": {
"kind": "Node",
"name": "k8s-master",
"uid": "k8s-master"
},
"reason": "OOMKilling",
"message": "Memory cgroup out of memory: Kill process 2235 (stress) score 0 or sacrifice child\nKilled process 2235 (stress) total-vm:515612kB, anon-rss:168728kB, file-rss:32kB, shmem-rss:0kB",
"source": {
"component": "kernel-monitor",
"host": "k8s-master"
},
"firstTimestamp": "2022-02-14T10:21:58Z",
"lastTimestamp": "2022-02-14T10:21:58Z",
"count": 1,
"type": "Warning",
"eventTime": null,
"reportingComponent": "",
"reportingInstance": ""
}

请问,开发者在容器发生后OOMKilling时,如何配置node-problem-detector.yaml和kube-eventer.yaml文件才能获取到Pod信息?

@lee-lib lee-lib changed the title 如何容器发生OOMKilling时,node-problen-detector向apiServer发送event时添加pod信息,以便获取到具体的pod发生OOM 在容器发生OOMKilling时,如何让node-problen-detector向apiServer发送event时添加pod信息,以便获取到具体的pod发生OOM Feb 15, 2022
@ringtail
Copy link
Member

ringtail commented May 7, 2022

阿里云的这个功能是在NPD中实现的,目前社区的OOM是在Node维度透出Warning事件。

@crushCoin
Copy link

crushCoin commented May 11, 2022

https://github.com/AliyunContainerService/node-problem-detector/blob/master/pkg/systemlogmonitor/log_monitor.go#L67 这里pod uid正则的原因
Cgroup Driver 为 systemd 可以;Cgroup Driver 为 cgroupfs 不行
修改源码重新编译(git clone -b alibabacloud-v0.8.10 https://githubfast.com/AliyunContainerService/node-problem-detector.git)
1、修改为 uuidRegx = regexp.MustCompile("[0-9a-f]{8}[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{12}|[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}")
2、https://github.com/AliyunContainerService/node-problem-detector/blob/master/pkg/systemlogmonitor/log_monitor.go#L216
修改为 (可以将 PodOOMKilling 和 OOMKilling 日志放在一起)
message = fmt.Sprintf("pod was OOM killed. node:%s pod:%s namespace:%s uuid:%s\n",
pod.Spec.NodeName, pod.Name, pod.Namespace, uuid) + message
3、修改 kernel-monitor.json ("bufferSize": 30,这个是用于日志匹配的 环形队列缓冲buffer 长度,可以适当改大)
rules 中 增加
{
"type": "temporary",
"reason": "PodOOMKilling",
"pattern": "Task in /kubepods(.+) killed as a result of limit of .(\n.)+ Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.*"
}

上面有转义 看图
企业微信截图_16522332592620

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants