Skip to content

CPU metrics are missing after running for several hours #2660

Closed
@tony612

Description

@tony612

I found the CPU metrics of a container will be 0 after running for several hours. The processes and cgroup of the container changes.

I can reproduce this using kops 1.5.3:

$ kops version
Version 1.5.3 (git-46364f6)
$ kops create cluster --cloud=aws --zones ap-northeast-1a ${NAME}
# just changed the nodes number to 1 from 2
$ kops update cluster ${NAME} --yes
(AMI: k8s-1.5-debian-jessie-amd64-hvm-ebs-2017-05-02 (ami-56695f31))

Then I save some useful(maybe) info:

$ sudo docker info
Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 3
Server Version: 1.12.3
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.65-k8s
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.862 GiB
Name: ip-172-20-36-154
ID: J2K6:5YGH:LAHZ:DMRU:W2XB:OILL:EH53:WANY:UK4J:7UIY:ULVP:44PQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
WARNING: No kernel memory limit support
Insecure Registries:
 127.0.0.0/8

$ sudo docker ps
CONTAINER ID        IMAGE                                        COMMAND                  CREATED             STATUS              PORTS               NAMES
04ce9bd6939b        gcr.io/google_containers/kube-proxy:v1.5.7   "/bin/sh -c 'echo -99"   16 minutes ago      Up 16 minutes                           k8s_kube-proxy.2c2d203f_kube-proxy-ip-172-20-36-154.ap-northeast-1.compute.internal_kube-system_7b35e1830aa0d2f2e3e3bf9fbae75ac5_3da5329f
$ sudo docker inspect -f '{{.State.Pid}}' 04ce9bd6939b
1510

$ cat /proc/1510/cgroup
10:pids:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
9:perf_event:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
8:net_cls,net_prio:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
7:freezer:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
6:devices:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
5:memory:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
4:blkio:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
3:cpu,cpuacct:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
2:cpuset:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
1:name=systemd:/system.slice/docker.service/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16

$ sudo docker top 04ce9bd6939b
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                1510                1495                0                   03:10               ?                   00:00:00            /bin/sh -c echo -998 > /proc/$$/oom_score_adj && kube-proxy --kubeconfig=/var/lib/kube-proxy/kubeconfig --resource-container="" --master=https://api.internal.test0.llstestk8s.com --v=2 1>>/var/log/kube-proxy.log 2>&1
root                1520                1510                0                   03:10               ?                   00:00:02            kube-proxy --kubeconfig=/var/lib/kube-proxy/kubeconfig --resource-container= --master=https://api.internal.test0.llstestk8s.com --v=2

The cAdvisor result is right:
cadvisor0

Then I run this script to monitor the cgroup changes:

#!/bin/bash

while :
do
  l=$(grep docker /proc/1510/cgroup | wc -l)
  if [ $l != 10 ]; then
    # send notification to Slack
  else
    printf "."
  fi
  sleep 60
done

After about 4 hours(May 31 06:50 UTC), I got the slack notification, and I checked the corresponding info and some logs:

$ cat /proc/1510/cgroup
10:pids:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
9:perf_event:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
8:net_cls,net_prio:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
7:freezer:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
6:devices:/
5:memory:/
4:blkio:/
3:cpu,cpuacct:/
2:cpuset:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
1:name=systemd:/system.slice/docker.service/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16

$ sudo docker top 04ce9bd6939b
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD

$ less /var/log/syslog
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopping RPC Port Mapper.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopped target RPC Port Mapper.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopping LSB: RPC portmapper replacement...
May 31 06:50:18 ip-172-20-36-154 rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"
May 31 06:50:18 ip-172-20-36-154 rpcbind[23556]: Stopping rpcbind daemon....
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopped LSB: RPC portmapper replacement.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:19 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:19 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:19 ip-172-20-36-154 systemd[1]: Stopped Provide limited super user privileges to specific users.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Starting LSB: RPC portmapper replacement...
May 31 06:50:20 ip-172-20-36-154 rpcbind[24955]: Starting rpcbind daemon....
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Started LSB: RPC portmapper replacement.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Starting RPC Port Mapper.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reached target RPC Port Mapper.

The cAdvisor result is wrong now:
cadvisor1

Any idea?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions