Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

/usr/bin/containerd-shim-kata-v2 processes persist and consume heavy CPU after corresponding containers exit #2719

Closed
RobertKrawitz opened this issue Jun 1, 2020 · 11 comments · Fixed by kata-containers/kata-containers#556 or #2916
Labels
bug Incorrect behaviour

Comments

@RobertKrawitz
Copy link

Description of problem

Ran a test that created 40 CPU soaker containers on a 32/64 thread system w/192 GB RAM.

Using the clusterbuster tool from https://github.com/RobertKrawitz/OpenShift4-tools

clusterbuster -b 5 -p 5 -P soaker -T pod -Y -e --container-resource-request=cpu=10m -v -N 1 -d 40 -r 1 -t 30 --report --cleanup -Q --kata

which I ran a number of times, I observed that the node in question retained hundreds of containerd-shim-kata-v2 processes that were consuming a large amount of CPU. Accompanying this were wide variations in the CPU utilization reported by the containers, wide variation in the work (simple loop iterations) accomplished by the containers, and amount of time required for containers to start from first to last.

Without use of Kata, it took about 1.1 seconds between the first and last container to start, achieved 3986% CPU utilization (within .5% of max), and achieved about 613M loop iterations/sec. With Kata, the first run was very close to that (3.9 seconds for the containers to start, 3984% CPU utilization, and 603M loop iterations/second. Running this in a loop 5 times, all of these numbers got worse. By the final iteration, I observed 506% CPU utilization, 45 seconds between first and last pod (in some cases I saw much more), and about 69M loop iterations/second.

Expected result

Performance to remain consistent with multiple runs, no accumulation of processes.

Actual result

See attached:
kata-log.txt

@RobertKrawitz RobertKrawitz added bug Incorrect behaviour needs-review Needs to be assessed by the team. labels Jun 1, 2020
@RobertKrawitz
Copy link
Author

@fidencio @haircommander

@fidencio
Copy link
Member

fidencio commented Jun 1, 2020

Thanks, I could noticed that when a pod using shimv2 is stopped, it does take several seconds for the shimv2 process to disappear.

I'll take a look at this one in the next few days.

@RobertKrawitz
Copy link
Author

After a while, the load drops, but the processes do not go away.

@RobertKrawitz
Copy link
Author

This is many minutes.

@RobertKrawitz
Copy link
Author

The containerd-shim processes only start consuming a lot of CPU when the underlying containers exit (with this test, all of the containers exit more or less simultaneously).

@RobertKrawitz
Copy link
Author

Running strace on one of them shows this:

[pid 834077] connect(9, {sa_family=AF_VSOCK, sa_data="\0\0\0\4\0\0c0\20\252\0\0\0\0"}, 16) = -1 ENODEV (No such device)
[pid 834077] close(9)                   = 0
[pid 834077] socket(AF_VSOCK, SOCK_STREAM, 0) = 9
[pid 834077] connect(9, {sa_family=AF_VSOCK, sa_data="\0\0\0\4\0\0c0\20\252\0\0\0\0"}, 16) = -1 ENODEV (No such device)
[pid 834077] close(9)                   = 0

continuously.

@fidencio
Copy link
Member

fidencio commented Jul 2, 2020

Turns out this seems to be an issue on CRI-O. I'm sending them a PR and I'd appreciate if you could give it a try to ensure it does solve your issue.

fidencio added a commit to fidencio/cri-o that referenced this issue Jul 2, 2020
When shutting the container down, we're dealing with the following piece
of code on Kata side:
https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785
```
func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) {
	defer func() {
		err = toGRPC(err)
	}()

	s.mu.Lock()
	if len(s.containers) != 0 {
		s.mu.Unlock()
		return empty, nil
	}
	s.mu.Unlock()

	s.cancel()

	os.Exit(0)

	// This will never be called, but this is only there to make sure the
	// program can compile.
	return empty, nil
}
```

The code shown above will simply stop the service, closing the ttrpc
channel, raising then the "ErrClosed" error, which is returned by the
Shutdown.

Differently from containerd code, which simply igores the error, CRI-O
propagates the error, leaving a bunch of processes behind that will
never ever be closed.

Here's what containerd does:
https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194
```
        _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{
                ID: s.ID(),
        })
        if err != nil && !errors.Is(err, ttrpc.ErrClosed) {
                return errdefs.FromGRPC(err)
        }
```

Knowing that, let's mimic what's been done by containerd and ignore the
error in this specific case.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
fidencio added a commit to fidencio/cri-o that referenced this issue Jul 7, 2020
When shutting the container down, we're dealing with the following piece
of code on Kata side:
https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785
```
func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) {
	defer func() {
		err = toGRPC(err)
	}()

	s.mu.Lock()
	if len(s.containers) != 0 {
		s.mu.Unlock()
		return empty, nil
	}
	s.mu.Unlock()

	s.cancel()

	os.Exit(0)

	// This will never be called, but this is only there to make sure the
	// program can compile.
	return empty, nil
}
```

The code shown above will simply stop the service, closing the ttrpc
channel, raising then the "ErrClosed" error, which is returned by the
Shutdown.

Differently from containerd code, which simply igores the error, CRI-O
propagates the error, leaving a bunch of processes behind that will
never ever be closed.

Here's what containerd does:
https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194
```
        _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{
                ID: s.ID(),
        })
        if err != nil && !errors.Is(err, ttrpc.ErrClosed) {
                return errdefs.FromGRPC(err)
        }
```

Knowing that, let's mimic what's been done by containerd and ignore the
error in this specific case.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
fidencio added a commit to fidencio/cri-o that referenced this issue Jul 11, 2020
When shutting the container down, we're dealing with the following piece
of code on Kata side:
https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785
```
func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) {
	defer func() {
		err = toGRPC(err)
	}()

	s.mu.Lock()
	if len(s.containers) != 0 {
		s.mu.Unlock()
		return empty, nil
	}
	s.mu.Unlock()

	s.cancel()

	os.Exit(0)

	// This will never be called, but this is only there to make sure the
	// program can compile.
	return empty, nil
}
```

The code shown above will simply stop the service, closing the ttrpc
channel, raising then the "ErrClosed" error, which is returned by the
Shutdown.

Differently from containerd code, which simply igores the error, CRI-O
propagates the error, leaving a bunch of processes behind that will
never ever be closed.

Here's what containerd does:
https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194
```
        _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{
                ID: s.ID(),
        })
        if err != nil && !errors.Is(err, ttrpc.ErrClosed) {
                return errdefs.FromGRPC(err)
        }
```

Knowing that, let's mimic what's been done by containerd and ignore the
error in this specific case.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
fidencio added a commit to fidencio/cri-o that referenced this issue Jul 14, 2020
When shutting the container down, we're dealing with the following piece
of code on Kata side:
https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785
```
func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) {
	defer func() {
		err = toGRPC(err)
	}()

	s.mu.Lock()
	if len(s.containers) != 0 {
		s.mu.Unlock()
		return empty, nil
	}
	s.mu.Unlock()

	s.cancel()

	os.Exit(0)

	// This will never be called, but this is only there to make sure the
	// program can compile.
	return empty, nil
}
```

The code shown above will simply stop the service, closing the ttrpc
channel, raising then the "ErrClosed" error, which is returned by the
Shutdown.

Differently from containerd code, which simply igores the error, CRI-O
propagates the error, leaving a bunch of processes behind that will
never ever be closed.

Here's what containerd does:
https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194
```
        _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{
                ID: s.ID(),
        })
        if err != nil && !errors.Is(err, ttrpc.ErrClosed) {
                return errdefs.FromGRPC(err)
        }
```

Knowing that, let's mimic what's been done by containerd and ignore the
error in this specific case.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
fidencio added a commit to fidencio/cri-o that referenced this issue Jul 15, 2020
When shutting the container down, we're dealing with the following piece
of code on Kata side:
https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785
```
func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) {
	defer func() {
		err = toGRPC(err)
	}()

	s.mu.Lock()
	if len(s.containers) != 0 {
		s.mu.Unlock()
		return empty, nil
	}
	s.mu.Unlock()

	s.cancel()

	os.Exit(0)

	// This will never be called, but this is only there to make sure the
	// program can compile.
	return empty, nil
}
```

The code shown above will simply stop the service, closing the ttrpc
channel, raising then the "ErrClosed" error, which is returned by the
Shutdown.

Differently from containerd code, which simply igores the error, CRI-O
propagates the error, leaving a bunch of processes behind that will
never ever be closed.

Here's what containerd does:
https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194
```
        _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{
                ID: s.ID(),
        })
        if err != nil && !errors.Is(err, ttrpc.ErrClosed) {
                return errdefs.FromGRPC(err)
        }
```

Knowing that, let's mimic what's been done by containerd and ignore the
error in this specific case.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
fidencio added a commit to fidencio/cri-o that referenced this issue Jul 15, 2020
When shutting the container down, we're dealing with the following piece
of code on Kata side:
https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785
```
func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) {
	defer func() {
		err = toGRPC(err)
	}()

	s.mu.Lock()
	if len(s.containers) != 0 {
		s.mu.Unlock()
		return empty, nil
	}
	s.mu.Unlock()

	s.cancel()

	os.Exit(0)

	// This will never be called, but this is only there to make sure the
	// program can compile.
	return empty, nil
}
```

The code shown above will simply stop the service, closing the ttrpc
channel, raising then the "ErrClosed" error, which is returned by the
Shutdown.

Differently from containerd code, which simply igores the error, CRI-O
propagates the error, leaving a bunch of processes behind that will
never ever be closed.

Here's what containerd does:
https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194
```
        _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{
                ID: s.ID(),
        })
        if err != nil && !errors.Is(err, ttrpc.ErrClosed) {
                return errdefs.FromGRPC(err)
        }
```

Knowing that, let's mimic what's been done by containerd and ignore the
error in this specific case.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cri-o that referenced this issue Jul 17, 2020
When shutting the container down, we're dealing with the following piece
of code on Kata side:
https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785
```
func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) {
	defer func() {
		err = toGRPC(err)
	}()

	s.mu.Lock()
	if len(s.containers) != 0 {
		s.mu.Unlock()
		return empty, nil
	}
	s.mu.Unlock()

	s.cancel()

	os.Exit(0)

	// This will never be called, but this is only there to make sure the
	// program can compile.
	return empty, nil
}
```

The code shown above will simply stop the service, closing the ttrpc
channel, raising then the "ErrClosed" error, which is returned by the
Shutdown.

Differently from containerd code, which simply igores the error, CRI-O
propagates the error, leaving a bunch of processes behind that will
never ever be closed.

Here's what containerd does:
https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194
```
        _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{
                ID: s.ID(),
        })
        if err != nil && !errors.Is(err, ttrpc.ErrClosed) {
                return errdefs.FromGRPC(err)
        }
```

Knowing that, let's mimic what's been done by containerd and ignore the
error in this specific case.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cri-o that referenced this issue Jul 17, 2020
When shutting the container down, we're dealing with the following piece
of code on Kata side:
https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785
```
func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) {
	defer func() {
		err = toGRPC(err)
	}()

	s.mu.Lock()
	if len(s.containers) != 0 {
		s.mu.Unlock()
		return empty, nil
	}
	s.mu.Unlock()

	s.cancel()

	os.Exit(0)

	// This will never be called, but this is only there to make sure the
	// program can compile.
	return empty, nil
}
```

The code shown above will simply stop the service, closing the ttrpc
channel, raising then the "ErrClosed" error, which is returned by the
Shutdown.

Differently from containerd code, which simply igores the error, CRI-O
propagates the error, leaving a bunch of processes behind that will
never ever be closed.

Here's what containerd does:
https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194
```
        _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{
                ID: s.ID(),
        })
        if err != nil && !errors.Is(err, ttrpc.ErrClosed) {
                return errdefs.FromGRPC(err)
        }
```

Knowing that, let's mimic what's been done by containerd and ignore the
error in this specific case.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
fidencio added a commit to fidencio/cri-o that referenced this issue Jul 22, 2020
When the container finishes its execution a containerd-shim-kata-v2
process is left behind (only when using CRI-O).

The reason for that seems to be CRI-O not doing a cleanup of the process
whenever the container state has changed its state from running to
stopped.

The most reasonable way found to perform such cleanup seems to be taking
advantage of the goroutine used to update the container status and
performing the cleanup there, whenever it's needed.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
fidencio added a commit to fidencio/cri-o that referenced this issue Jul 22, 2020
When the container finishes its execution a containerd-shim-kata-v2
process is left behind (only when using CRI-O).

The reason for that seems to be CRI-O not doing a cleanup of the process
whenever the container state has changed its state from running to
stopped.

The most reasonable way found to perform such cleanup seems to be taking
advantage of the goroutine used to update the container status and
performing the cleanup there, whenever it's needed.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
fidencio added a commit to fidencio/cri-o that referenced this issue Jul 22, 2020
When the container finishes its execution a containerd-shim-kata-v2
process is left behind (only when using CRI-O).

The reason for that seems to be CRI-O not doing a cleanup of the process
whenever the container state has changed its state from running to
stopped.

The most reasonable way found to perform such cleanup seems to be taking
advantage of the goroutine used to update the container status and
performing the cleanup there, whenever it's needed.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
fidencio added a commit to fidencio/cri-o that referenced this issue Jul 24, 2020
When the container finishes its execution a containerd-shim-kata-v2
process is left behind (only when using CRI-O).

The reason for that seems to be CRI-O not doing a cleanup of the process
whenever the container state has changed its state from running to
stopped.

The most reasonable way found to perform such cleanup seems to be taking
advantage of the goroutine used to update the container status and
performing the cleanup there, whenever it's needed.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
fidencio added a commit to fidencio/cri-o that referenced this issue Jul 28, 2020
When the container finishes its execution a containerd-shim-kata-v2
process is left behind (only when using CRI-O).

The reason for that seems to be CRI-O not doing a cleanup of the process
whenever the container state has changed its state from running to
stopped.

The most reasonable way found to perform such cleanup seems to be taking
advantage of the goroutine used to update the container status and
performing the cleanup there, whenever it's needed.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
@fidencio
Copy link
Member

I'm closing this one as all the patches ended up being merged on CRI-O.

fidencio added a commit to fidencio/cri-o that referenced this issue Aug 2, 2020
When the container finishes its execution a containerd-shim-kata-v2
process is left behind (only when using CRI-O).

The reason for that seems to be CRI-O not doing a cleanup of the process
whenever the container state has changed its state from running to
stopped.

The most reasonable way found to perform such cleanup seems to be taking
advantage of the goroutine used to update the container status and
performing the cleanup there, whenever it's needed.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
(cherry picked from commit 814c1bb)
fidencio added a commit to fidencio/cri-o that referenced this issue Aug 6, 2020
When shutting the container down, we're dealing with the following piece
of code on Kata side:
https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785
```
func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) {
	defer func() {
		err = toGRPC(err)
	}()

	s.mu.Lock()
	if len(s.containers) != 0 {
		s.mu.Unlock()
		return empty, nil
	}
	s.mu.Unlock()

	s.cancel()

	os.Exit(0)

	// This will never be called, but this is only there to make sure the
	// program can compile.
	return empty, nil
}
```

The code shown above will simply stop the service, closing the ttrpc
channel, raising then the "ErrClosed" error, which is returned by the
Shutdown.

Differently from containerd code, which simply igores the error, CRI-O
propagates the error, leaving a bunch of processes behind that will
never ever be closed.

Here's what containerd does:
https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194
```
        _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{
                ID: s.ID(),
        })
        if err != nil && !errors.Is(err, ttrpc.ErrClosed) {
                return errdefs.FromGRPC(err)
        }
```

Knowing that, let's mimic what's been done by containerd and ignore the
error in this specific case.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
(cherry picked from commit 45b778d)
fidencio added a commit to fidencio/cri-o that referenced this issue Aug 6, 2020
When the container finishes its execution a containerd-shim-kata-v2
process is left behind (only when using CRI-O).

The reason for that seems to be CRI-O not doing a cleanup of the process
whenever the container state has changed its state from running to
stopped.

The most reasonable way found to perform such cleanup seems to be taking
advantage of the goroutine used to update the container status and
performing the cleanup there, whenever it's needed.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
(cherry picked from commit 814c1bb)
(cherry picked from commit adb657c)
fidencio added a commit to fidencio/cri-o that referenced this issue Aug 10, 2020
When shutting the container down, we're dealing with the following piece
of code on Kata side:
https://github.com/kata-containers/runtime/blob/master/containerd-shim-v2/service.go#L785
```
func (s *service) Shutdown(ctx context.Context, r *taskAPI.ShutdownRequest) (_ *ptypes.Empty, err error) {
	defer func() {
		err = toGRPC(err)
	}()

	s.mu.Lock()
	if len(s.containers) != 0 {
		s.mu.Unlock()
		return empty, nil
	}
	s.mu.Unlock()

	s.cancel()

	os.Exit(0)

	// This will never be called, but this is only there to make sure the
	// program can compile.
	return empty, nil
}
```

The code shown above will simply stop the service, closing the ttrpc
channel, raising then the "ErrClosed" error, which is returned by the
Shutdown.

Differently from containerd code, which simply igores the error, CRI-O
propagates the error, leaving a bunch of processes behind that will
never ever be closed.

Here's what containerd does:
https://github.com/containerd/containerd/blob/master/runtime/v2/shim.go#L194
```
        _, err := s.task.Shutdown(ctx, &task.ShutdownRequest{
                ID: s.ID(),
        })
        if err != nil && !errors.Is(err, ttrpc.ErrClosed) {
                return errdefs.FromGRPC(err)
        }
```

Knowing that, let's mimic what's been done by containerd and ignore the
error in this specific case.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
(cherry picked from commit 45b778d)
fidencio added a commit to fidencio/cri-o that referenced this issue Aug 10, 2020
When the container finishes its execution a containerd-shim-kata-v2
process is left behind (only when using CRI-O).

The reason for that seems to be CRI-O not doing a cleanup of the process
whenever the container state has changed its state from running to
stopped.

The most reasonable way found to perform such cleanup seems to be taking
advantage of the goroutine used to update the container status and
performing the cleanup there, whenever it's needed.

Related: kata-containers/runtime#2719

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
(cherry picked from commit 814c1bb)
(cherry picked from commit adb657c)
@haircommander
Copy link
Contributor

fidencio/cri-o@345c016 is leading to problems where we remove the container on a successful container exit, and the kubelet tries to remove the container (but it doesn't exist in the ctrs map)

@evanfoster
Copy link

Do we know why the shim uses so much CPU when the containers die? I wonder if it's busy-looping.

evanfoster pushed a commit to evanfoster/cri-o that referenced this issue Aug 20, 2020
When a pod using the VM runtime type stops, the actual runtime process
should also be stopped.

Previously, the runtime process was killed when the pod was deleted. This
works well for many workloads, but causes process leaks when large
numbers of one-shot pods are created (e.g. pods that enter the
"Completed" state in Kubernetes). Those pods will eventually be cleaned
up, but until then a large number of runtime processes will hang around.
The situation is made worse when the runtime process enters a bad state
after its container dies (see
kata-containers/runtime#2719).

Initially, this problem was addressed in
cri-o#3998
However, that PR worked by actually deleting VM runtime containers on
stop, which led to issues with pods being stuck in a NotReady state
indefinitely.

This PR re-addresses the issue solved by 3998 by sending a shutdown task
to the runtime on pod stop.

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/cri-o that referenced this issue Aug 20, 2020
When a pod using the VM runtime type stops, the actual runtime process
should also be stopped.

Previously, the runtime process was killed when the pod was deleted. This
works well for many workloads, but causes process leaks when large
numbers of one-shot pods are created (e.g. pods that enter the
"Completed" state in Kubernetes). Those pods will eventually be cleaned
up, but until then a large number of runtime processes will hang around.
The situation is made worse when the runtime process enters a bad state
after its container dies (see
kata-containers/runtime#2719).

Initially, this problem was addressed in
cri-o#3998
However, that PR worked by actually deleting VM runtime containers on
stop, which led to issues with pods being stuck in a NotReady state
indefinitely.

This PR re-addresses the issue solved by 3998 by sending a shutdown task
to the runtime on pod stop.

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/cri-o that referenced this issue Aug 20, 2020
When a pod using the VM runtime type stops, the actual runtime process
should also be stopped.

Previously, the runtime process was killed when the pod was deleted. This
works well for many workloads, but causes process leaks when large
numbers of one-shot pods are created (e.g. pods that enter the
"Completed" state in Kubernetes). Those pods will eventually be cleaned
up, but until then a large number of runtime processes will hang around.
The situation is made worse when the runtime process enters a bad state
after its container dies (see
kata-containers/runtime#2719).

Initially, this problem was addressed in
cri-o#3998
However, that PR worked by actually deleting VM runtime containers on
stop, which led to issues with pods being stuck in a NotReady state
indefinitely.

This PR re-addresses the issue solved by 3998 by sending a shutdown task
to the runtime on pod stop.

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/cri-o that referenced this issue Aug 21, 2020
When a pod using the VM runtime type stops, the actual runtime process
should also be stopped.

Previously, the runtime process was killed when the pod was deleted. This
works well for many workloads, but causes process leaks when large
numbers of one-shot pods are created (e.g. pods that enter the
"Completed" state in Kubernetes). Those pods will eventually be cleaned
up, but until then a large number of runtime processes will hang around.
The situation is made worse when the runtime process enters a bad state
after its container dies (see
kata-containers/runtime#2719).

Initially, this problem was addressed in
cri-o#3998
However, that PR worked by actually deleting VM runtime containers on
stop, which led to issues with pods being stuck in a NotReady state
indefinitely.

This PR re-addresses the issue solved by 3998 by sending a shutdown task
to the runtime on pod stop.

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/cri-o that referenced this issue Aug 21, 2020
When a pod using the VM runtime type stops, the actual runtime process
should also be stopped.

Previously, the runtime process was killed when the pod was deleted. This
works well for many workloads, but causes process leaks when large
numbers of one-shot pods are created (e.g. pods that enter the
"Completed" state in Kubernetes). Those pods will eventually be cleaned
up, but until then a large number of runtime processes will hang around.
The situation is made worse when the runtime process enters a bad state
after its container dies (see
kata-containers/runtime#2719).

Initially, this problem was addressed in
cri-o#3998
However, that PR worked by actually deleting VM runtime containers on
stop, which led to issues with pods being stuck in a NotReady state
indefinitely.

This PR re-addresses the issue solved by 3998 by sending a shutdown task
to the runtime on pod stop.

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/cri-o that referenced this issue Aug 21, 2020
When a pod using the VM runtime type stops, the actual runtime process
should also be stopped.

Previously, the runtime process was killed when the pod was deleted. This
works well for many workloads, but causes process leaks when large
numbers of one-shot pods are created (e.g. pods that enter the
"Completed" state in Kubernetes). Those pods will eventually be cleaned
up, but until then a large number of runtime processes will hang around.
The situation is made worse when the runtime process enters a bad state
after its container dies (see
kata-containers/runtime#2719).

Initially, this problem was addressed in
cri-o#3998
However, that PR worked by actually deleting VM runtime containers on
stop, which led to issues with pods being stuck in a NotReady state
indefinitely.

This PR re-addresses the issue solved by 3998 by sending a shutdown task
to the runtime on pod stop.

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/cri-o that referenced this issue Aug 21, 2020
When a pod using the VM runtime type stops, the actual runtime process
should also be stopped.

Previously, the runtime process was killed when the pod was deleted. This
works well for many workloads, but causes process leaks when large
numbers of one-shot pods are created (e.g. pods that enter the
"Completed" state in Kubernetes). Those pods will eventually be cleaned
up, but until then a large number of runtime processes will hang around.
The situation is made worse when the runtime process enters a bad state
after its container dies (see
kata-containers/runtime#2719).

Initially, this problem was addressed in
cri-o#3998
However, that PR worked by actually deleting VM runtime containers on
stop, which led to issues with pods being stuck in a NotReady state
indefinitely.

This PR re-addresses the issue solved by 3998 by sending a shutdown task
to the runtime on pod stop.

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/cri-o that referenced this issue Aug 21, 2020
When a pod using the VM runtime type stops, the actual runtime process
should also be stopped.

Previously, the runtime process was killed when the pod was deleted. This
works well for many workloads, but causes process leaks when large
numbers of one-shot pods are created (e.g. pods that enter the
"Completed" state in Kubernetes). Those pods will eventually be cleaned
up, but until then a large number of runtime processes will hang around.
The situation is made worse when the runtime process enters a bad state
after its container dies (see
kata-containers/runtime#2719).

Initially, this problem was addressed in
cri-o#3998
However, that PR worked by actually deleting VM runtime containers on
stop, which led to issues with pods being stuck in a NotReady state
indefinitely.

This PR re-addresses the issue solved by 3998 by sending a shutdown task
to the runtime on pod stop.

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/kata-containers that referenced this issue Aug 22, 2020
When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until
the pod is actually deleted, even though the VM is shut down. In this
case, the shim appears to busyloop when attempting to talk to the (now
dead) agent via VSOCK. To address this, we disconnect from the agent
after the VM is shut down.

This is especially catastrophic for one-shot pods that may persist for
hours or days, but it also applies to any shimv2 pod where Kata is
configured to use VSOCK for communication.

Fixes github.com/kata-containers/runtime#2719

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/kata-containers that referenced this issue Aug 22, 2020
When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until
the pod is actually deleted, even though the VM is shut down. In this
case, the shim appears to busyloop when attempting to talk to the (now
dead) agent via VSOCK. To address this, we disconnect from the agent
after the VM is shut down.

This is especially catastrophic for one-shot pods that may persist for
hours or days, but it also applies to any shimv2 pod where Kata is
configured to use VSOCK for communication.

See github.com/kata-containers/runtime#2719 for details.
Fixes kata-containers#2719

Signed-off-by: Evan Foster <efoster@adobe.com>
@evanfoster
Copy link

evanfoster commented Aug 22, 2020

It was basically busy looping when trying to talk to the agent in the VM that's been deleted. Assuming it passes review, I'll be backporting kata-containers/kata-containers#556 to Kata 1.X.

For fun, here's the call graph from profiling:
pprof004

EDIT: I should also note that I think this is the same problem reported in #1917, but focused on shutdown instead of startup.

@egernst egernst added bug Incorrect behaviour and removed bug Incorrect behaviour needs-review Needs to be assessed by the team. labels Aug 24, 2020
evanfoster pushed a commit to evanfoster/kata-containers that referenced this issue Aug 24, 2020
When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until
the pod is actually deleted, even though the VM is shut down. In this
case, the shim appears to busyloop when attempting to talk to the (now
dead) agent via VSOCK. To address this, we disconnect from the agent
after the VM is shut down.

This is especially catastrophic for one-shot pods that may persist for
hours or days, but it also applies to any shimv2 pod where Kata is
configured to use VSOCK for communication.

See github.com/kata-containers/runtime#2719 for details.
Fixes kata-containers#2719

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/kata-containers that referenced this issue Aug 24, 2020
When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until
the pod is actually deleted, even though the VM is shut down. In this
case, the shim appears to busyloop when attempting to talk to the (now
dead) agent via VSOCK. To address this, we disconnect from the agent
after the VM is shut down.

This is especially catastrophic for one-shot pods that may persist for
hours or days, but it also applies to any shimv2 pod where Kata is
configured to use VSOCK for communication.

See github.com/kata-containers/runtime#2719 for details.
Fixes kata-containers#2719

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/runtime that referenced this issue Aug 31, 2020
When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until
the pod is actually deleted, even though the VM is shut down. In this
case, the shim appears to busyloop when attempting to talk to the (now
dead) agent via VSOCK. To address this, we disconnect from the agent
after the VM is shut down.

This is especially catastrophic for one-shot pods that may persist for
hours or days, but it also applies to any shimv2 pod where Kata is
configured to use VSOCK for communication.

Backport of kata-containers/kata-containers#556
to kata-containers/runtime master branch.

See github.com/kata-containers#2719 for details.
Fixes kata-containers#2719

Signed-off-by: Evan Foster <efoster@adobe.com>
evanfoster pushed a commit to evanfoster/runtime that referenced this issue Aug 31, 2020
When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until
the pod is actually deleted, even though the VM is shut down. In this
case, the shim appears to busyloop when attempting to talk to the (now
dead) agent via VSOCK. To address this, we disconnect from the agent
after the VM is shut down.

This is especially catastrophic for one-shot pods that may persist for
hours or days, but it also applies to any shimv2 pod where Kata is
configured to use VSOCK for communication.

Backport of kata-containers/kata-containers#556
to kata-containers/runtime 1.11 branch.

Fixes kata-containers#2719

Signed-off-by: Evan Foster <efoster@adobe.com>
(cherry picked from commit 227cba6)
fidencio pushed a commit to fidencio/kata-runtime that referenced this issue Sep 11, 2020
When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until
the pod is actually deleted, even though the VM is shut down. In this
case, the shim appears to busyloop when attempting to talk to the (now
dead) agent via VSOCK. To address this, we disconnect from the agent
after the VM is shut down.

This is especially catastrophic for one-shot pods that may persist for
hours or days, but it also applies to any shimv2 pod where Kata is
configured to use VSOCK for communication.

Backport of kata-containers/kata-containers#556
to kata-containers/runtime 1.11 branch.

Fixes kata-containers#2719

Signed-off-by: Evan Foster <efoster@adobe.com>
(cherry picked from commit 227cba6)
jcvenegas pushed a commit to jcvenegas/runtime that referenced this issue Oct 19, 2020
When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until
the pod is actually deleted, even though the VM is shut down. In this
case, the shim appears to busyloop when attempting to talk to the (now
dead) agent via VSOCK. To address this, we disconnect from the agent
after the VM is shut down.

This is especially catastrophic for one-shot pods that may persist for
hours or days, but it also applies to any shimv2 pod where Kata is
configured to use VSOCK for communication.

Backport of kata-containers/kata-containers#556
to kata-containers/runtime master branch.

See github.com/kata-containers#2719 for details.
Fixes kata-containers#2719

Signed-off-by: Evan Foster <efoster@adobe.com>
jcvenegas pushed a commit to jcvenegas/runtime that referenced this issue Oct 19, 2020
When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until
the pod is actually deleted, even though the VM is shut down. In this
case, the shim appears to busyloop when attempting to talk to the (now
dead) agent via VSOCK. To address this, we disconnect from the agent
after the VM is shut down.

This is especially catastrophic for one-shot pods that may persist for
hours or days, but it also applies to any shimv2 pod where Kata is
configured to use VSOCK for communication.

Backport of kata-containers/kata-containers#556
to kata-containers/runtime master branch.

See github.com/kata-containers#2719 for details.
Fixes kata-containers#2719

Signed-off-by: Evan Foster <efoster@adobe.com>
jcvenegas pushed a commit to jcvenegas/runtime that referenced this issue Oct 20, 2020
When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until
the pod is actually deleted, even though the VM is shut down. In this
case, the shim appears to busyloop when attempting to talk to the (now
dead) agent via VSOCK. To address this, we disconnect from the agent
after the VM is shut down.

This is especially catastrophic for one-shot pods that may persist for
hours or days, but it also applies to any shimv2 pod where Kata is
configured to use VSOCK for communication.

Backport of kata-containers/kata-containers#556
to kata-containers/runtime master branch.

See github.com/kata-containers#2719 for details.
Fixes kata-containers#2719

Signed-off-by: Evan Foster <efoster@adobe.com>
wainersm pushed a commit to wainersm/kc-runtime that referenced this issue Nov 18, 2020
When a one-shot pod dies in CRI-O, the shimv2 process isn't killed until
the pod is actually deleted, even though the VM is shut down. In this
case, the shim appears to busyloop when attempting to talk to the (now
dead) agent via VSOCK. To address this, we disconnect from the agent
after the VM is shut down.

This is especially catastrophic for one-shot pods that may persist for
hours or days, but it also applies to any shimv2 pod where Kata is
configured to use VSOCK for communication.

Backport of kata-containers/kata-containers#556
to kata-containers/runtime master branch.

See github.com/kata-containers#2719 for details.
Fixes kata-containers#2719

Signed-off-by: Evan Foster <efoster@adobe.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Incorrect behaviour
Projects
None yet
5 participants