/usr/bin/containerd-shim-kata-v2 processes persist and consume heavy CPU after corresponding containers exit #2719

RobertKrawitz · 2020-06-01T19:44:28Z

Description of problem

Ran a test that created 40 CPU soaker containers on a 32/64 thread system w/192 GB RAM.

Using the clusterbuster tool from https://github.com/RobertKrawitz/OpenShift4-tools

clusterbuster -b 5 -p 5 -P soaker -T pod -Y -e --container-resource-request=cpu=10m -v -N 1 -d 40 -r 1 -t 30 --report --cleanup -Q --kata

which I ran a number of times, I observed that the node in question retained hundreds of containerd-shim-kata-v2 processes that were consuming a large amount of CPU. Accompanying this were wide variations in the CPU utilization reported by the containers, wide variation in the work (simple loop iterations) accomplished by the containers, and amount of time required for containers to start from first to last.

Without use of Kata, it took about 1.1 seconds between the first and last container to start, achieved 3986% CPU utilization (within .5% of max), and achieved about 613M loop iterations/sec. With Kata, the first run was very close to that (3.9 seconds for the containers to start, 3984% CPU utilization, and 603M loop iterations/second. Running this in a loop 5 times, all of these numbers got worse. By the final iteration, I observed 506% CPU utilization, 45 seconds between first and last pod (in some cases I saw much more), and about 69M loop iterations/second.

Expected result

Performance to remain consistent with multiple runs, no accumulation of processes.

Actual result

See attached:
kata-log.txt

The text was updated successfully, but these errors were encountered:

RobertKrawitz · 2020-06-01T19:44:58Z

@fidencio @haircommander

fidencio · 2020-06-01T19:48:30Z

Thanks, I could noticed that when a pod using shimv2 is stopped, it does take several seconds for the shimv2 process to disappear.

I'll take a look at this one in the next few days.

RobertKrawitz · 2020-06-01T19:51:20Z

After a while, the load drops, but the processes do not go away.

RobertKrawitz · 2020-06-01T19:57:23Z

This is many minutes.

RobertKrawitz · 2020-06-01T20:07:59Z

The containerd-shim processes only start consuming a lot of CPU when the underlying containers exit (with this test, all of the containers exit more or less simultaneously).

RobertKrawitz · 2020-06-01T20:11:29Z

Running strace on one of them shows this:

[pid 834077] connect(9, {sa_family=AF_VSOCK, sa_data="\0\0\0\4\0\0c0\20\252\0\0\0\0"}, 16) = -1 ENODEV (No such device)
[pid 834077] close(9)                   = 0
[pid 834077] socket(AF_VSOCK, SOCK_STREAM, 0) = 9
[pid 834077] connect(9, {sa_family=AF_VSOCK, sa_data="\0\0\0\4\0\0c0\20\252\0\0\0\0"}, 16) = -1 ENODEV (No such device)
[pid 834077] close(9)                   = 0

continuously.

fidencio · 2020-07-02T19:49:25Z

Turns out this seems to be an issue on CRI-O. I'm sending them a PR and I'd appreciate if you could give it a try to ensure it does solve your issue.

When shutting the container down, we're dealing with the following piece of code on Kata side: https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785 ``` func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) { defer func() { err = toGRPC(err) }() s.mu.Lock() if len(s.containers) != 0 { s.mu.Unlock() return empty, nil } s.mu.Unlock() s.cancel() os.Exit(0) // This will never be called, but this is only there to make sure the // program can compile. return empty, nil } ``` The code shown above will simply stop the service, closing the ttrpc channel, raising then the "ErrClosed" error, which is returned by the Shutdown. Differently from containerd code, which simply igores the error, CRI-O propagates the error, leaving a bunch of processes behind that will never ever be closed. Here's what containerd does: https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194 ``` _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{ ID: s.ID(), }) if err != nil && !errors.Is(err, ttrpc.ErrClosed) { return errdefs.FromGRPC(err) } ``` Knowing that, let's mimic what's been done by containerd and ignore the error in this specific case. Related: kata-containers/runtime#2719 Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>

When the container finishes its execution a containerd-shim-kata-v2 process is left behind (only when using CRI-O). The reason for that seems to be CRI-O not doing a cleanup of the process whenever the container state has changed its state from running to stopped. The most reasonable way found to perform such cleanup seems to be taking advantage of the goroutine used to update the container status and performing the cleanup there, whenever it's needed. Related: kata-containers/runtime#2719 Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>

fidencio · 2020-07-31T22:45:26Z

I'm closing this one as all the patches ended up being merged on CRI-O.

When the container finishes its execution a containerd-shim-kata-v2 process is left behind (only when using CRI-O). The reason for that seems to be CRI-O not doing a cleanup of the process whenever the container state has changed its state from running to stopped. The most reasonable way found to perform such cleanup seems to be taking advantage of the goroutine used to update the container status and performing the cleanup there, whenever it's needed. Related: kata-containers/runtime#2719 Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com> (cherry picked from commit 814c1bb)

When shutting the container down, we're dealing with the following piece of code on Kata side: https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785 ``` func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) { defer func() { err = toGRPC(err) }() s.mu.Lock() if len(s.containers) != 0 { s.mu.Unlock() return empty, nil } s.mu.Unlock() s.cancel() os.Exit(0) // This will never be called, but this is only there to make sure the // program can compile. return empty, nil } ``` The code shown above will simply stop the service, closing the ttrpc channel, raising then the "ErrClosed" error, which is returned by the Shutdown. Differently from containerd code, which simply igores the error, CRI-O propagates the error, leaving a bunch of processes behind that will never ever be closed. Here's what containerd does: https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194 ``` _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{ ID: s.ID(), }) if err != nil && !errors.Is(err, ttrpc.ErrClosed) { return errdefs.FromGRPC(err) } ``` Knowing that, let's mimic what's been done by containerd and ignore the error in this specific case. Related: kata-containers/runtime#2719 Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com> (cherry picked from commit 45b778d)

When the container finishes its execution a containerd-shim-kata-v2 process is left behind (only when using CRI-O). The reason for that seems to be CRI-O not doing a cleanup of the process whenever the container state has changed its state from running to stopped. The most reasonable way found to perform such cleanup seems to be taking advantage of the goroutine used to update the container status and performing the cleanup there, whenever it's needed. Related: kata-containers/runtime#2719 Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com> (cherry picked from commit 814c1bb) (cherry picked from commit adb657c)

When shutting the container down, we're dealing with the following piece of code on Kata side: https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785 ``` func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) { defer func() { err = toGRPC(err) }() s.mu.Lock() if len(s.containers) != 0 { s.mu.Unlock() return empty, nil } s.mu.Unlock() s.cancel() os.Exit(0) // This will never be called, but this is only there to make sure the // program can compile. return empty, nil } ``` The code shown above will simply stop the service, closing the ttrpc channel, raising then the "ErrClosed" error, which is returned by the Shutdown. Differently from containerd code, which simply igores the error, CRI-O propagates the error, leaving a bunch of processes behind that will never ever be closed. Here's what containerd does: https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194 ``` _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{ ID: s.ID(), }) if err != nil && !errors.Is(err, ttrpc.ErrClosed) { return errdefs.FromGRPC(err) } ``` Knowing that, let's mimic what's been done by containerd and ignore the error in this specific case. Related: kata-containers/runtime#2719 Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com> (cherry picked from commit 45b778d)

When the container finishes its execution a containerd-shim-kata-v2 process is left behind (only when using CRI-O). The reason for that seems to be CRI-O not doing a cleanup of the process whenever the container state has changed its state from running to stopped. The most reasonable way found to perform such cleanup seems to be taking advantage of the goroutine used to update the container status and performing the cleanup there, whenever it's needed. Related: kata-containers/runtime#2719 Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com> (cherry picked from commit 814c1bb) (cherry picked from commit adb657c)

haircommander · 2020-08-19T17:59:13Z

fidencio/cri-o@345c016 is leading to problems where we remove the container on a successful container exit, and the kubelet tries to remove the container (but it doesn't exist in the ctrs map)

evanfoster · 2020-08-19T19:20:06Z

Do we know why the shim uses so much CPU when the containers die? I wonder if it's busy-looping.

When a pod using the VM runtime type stops, the actual runtime process should also be stopped. Previously, the runtime process was killed when the pod was deleted. This works well for many workloads, but causes process leaks when large numbers of one-shot pods are created (e.g. pods that enter the "Completed" state in Kubernetes). Those pods will eventually be cleaned up, but until then a large number of runtime processes will hang around. The situation is made worse when the runtime process enters a bad state after its container dies (see kata-containers/runtime#2719). Initially, this problem was addressed in cri-o#3998 However, that PR worked by actually deleting VM runtime containers on stop, which led to issues with pods being stuck in a NotReady state indefinitely. This PR re-addresses the issue solved by 3998 by sending a shutdown task to the runtime on pod stop. Signed-off-by: Evan Foster <efoster@adobe.com>

When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until the pod is actually deleted, even though the VM is shut down. In this case, the shim appears to busyloop when attempting to talk to the (now dead) agent via VSOCK. To address this, we disconnect from the agent after the VM is shut down. This is especially catastrophic for one-shot pods that may persist for hours or days, but it also applies to any shimv2 pod where Kata is configured to use VSOCK for communication. Fixes github.com/kata-containers/runtime#2719 Signed-off-by: Evan Foster <efoster@adobe.com>

When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until the pod is actually deleted, even though the VM is shut down. In this case, the shim appears to busyloop when attempting to talk to the (now dead) agent via VSOCK. To address this, we disconnect from the agent after the VM is shut down. This is especially catastrophic for one-shot pods that may persist for hours or days, but it also applies to any shimv2 pod where Kata is configured to use VSOCK for communication. See github.com/kata-containers/runtime#2719 for details. Fixes kata-containers#2719 Signed-off-by: Evan Foster <efoster@adobe.com>

evanfoster · 2020-08-22T00:59:57Z

It was basically busy looping when trying to talk to the agent in the VM that's been deleted. Assuming it passes review, I'll be backporting kata-containers/kata-containers#556 to Kata 1.X.

For fun, here's the call graph from profiling:

EDIT: I should also note that I think this is the same problem reported in #1917, but focused on shutdown instead of startup.

When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until the pod is actually deleted, even though the VM is shut down. In this case, the shim appears to busyloop when attempting to talk to the (now dead) agent via VSOCK. To address this, we disconnect from the agent after the VM is shut down. This is especially catastrophic for one-shot pods that may persist for hours or days, but it also applies to any shimv2 pod where Kata is configured to use VSOCK for communication. See github.com/kata-containers/runtime#2719 for details. Fixes kata-containers#2719 Signed-off-by: Evan Foster <efoster@adobe.com>

When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until the pod is actually deleted, even though the VM is shut down. In this case, the shim appears to busyloop when attempting to talk to the (now dead) agent via VSOCK. To address this, we disconnect from the agent after the VM is shut down. This is especially catastrophic for one-shot pods that may persist for hours or days, but it also applies to any shimv2 pod where Kata is configured to use VSOCK for communication. Backport of kata-containers/kata-containers#556 to kata-containers/runtime master branch. See github.com/kata-containers#2719 for details. Fixes kata-containers#2719 Signed-off-by: Evan Foster <efoster@adobe.com>

When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until the pod is actually deleted, even though the VM is shut down. In this case, the shim appears to busyloop when attempting to talk to the (now dead) agent via VSOCK. To address this, we disconnect from the agent after the VM is shut down. This is especially catastrophic for one-shot pods that may persist for hours or days, but it also applies to any shimv2 pod where Kata is configured to use VSOCK for communication. Backport of kata-containers/kata-containers#556 to kata-containers/runtime 1.11 branch. Fixes kata-containers#2719 Signed-off-by: Evan Foster <efoster@adobe.com> (cherry picked from commit 227cba6)

When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until the pod is actually deleted, even though the VM is shut down. In this case, the shim appears to busyloop when attempting to talk to the (now dead) agent via VSOCK. To address this, we disconnect from the agent after the VM is shut down. This is especially catastrophic for one-shot pods that may persist for hours or days, but it also applies to any shimv2 pod where Kata is configured to use VSOCK for communication. Backport of kata-containers/kata-containers#556 to kata-containers/runtime master branch. See github.com/kata-containers#2719 for details. Fixes kata-containers#2719 Signed-off-by: Evan Foster <efoster@adobe.com>

RobertKrawitz added bug Incorrect behaviour needs-review Needs to be assessed by the team. labels Jun 1, 2020

evanfoster mentioned this issue Jul 7, 2020

Sandbox mounts aren't being cleaned up when containers fail to start #2816

Closed

fidencio mentioned this issue Jul 22, 2020

runtimeVM: Cleanup a "Completed" container cri-o/cri-o#3998

Merged

fidencio closed this as completed Jul 31, 2020

evanfoster mentioned this issue Aug 22, 2020

sandbox: Disconnect from agent after VM shutdown kata-containers/kata-containers#556

Merged

egernst added bug Incorrect behaviour and removed bug Incorrect behaviour needs-review Needs to be assessed by the team. labels Aug 24, 2020

evanfoster mentioned this issue Aug 31, 2020

[backport | master] sandbox: Disconnect from agent after VM shutdown #2916

Merged

evanfoster mentioned this issue Aug 31, 2020

[backport | stable-1.11] sandbox: Disconnect from agent after VM shutdown #2917

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/usr/bin/containerd-shim-kata-v2 processes persist and consume heavy CPU after corresponding containers exit #2719

/usr/bin/containerd-shim-kata-v2 processes persist and consume heavy CPU after corresponding containers exit #2719

RobertKrawitz commented Jun 1, 2020

RobertKrawitz commented Jun 1, 2020

fidencio commented Jun 1, 2020

RobertKrawitz commented Jun 1, 2020

RobertKrawitz commented Jun 1, 2020

RobertKrawitz commented Jun 1, 2020

RobertKrawitz commented Jun 1, 2020

fidencio commented Jul 2, 2020

fidencio commented Jul 31, 2020

haircommander commented Aug 19, 2020

evanfoster commented Aug 19, 2020

evanfoster commented Aug 22, 2020 •

edited

Loading

/usr/bin/containerd-shim-kata-v2 processes persist and consume heavy CPU after corresponding containers exit #2719

/usr/bin/containerd-shim-kata-v2 processes persist and consume heavy CPU after corresponding containers exit #2719

Comments

RobertKrawitz commented Jun 1, 2020

Description of problem

Expected result

Actual result

RobertKrawitz commented Jun 1, 2020

fidencio commented Jun 1, 2020

RobertKrawitz commented Jun 1, 2020

RobertKrawitz commented Jun 1, 2020

RobertKrawitz commented Jun 1, 2020

RobertKrawitz commented Jun 1, 2020

fidencio commented Jul 2, 2020

fidencio commented Jul 31, 2020

haircommander commented Aug 19, 2020

evanfoster commented Aug 19, 2020

evanfoster commented Aug 22, 2020 • edited Loading

evanfoster commented Aug 22, 2020 •

edited

Loading