-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ro-bind /proc/self/exe before copying cause systemd's cpu extremely high #3925
Comments
This is a duplicate of #2532 afaics.
Using memfds, or merging #3599 (which moves the bindfd stuff into a separate mount namespace). I'm personally not a fan of how complicated #3599 makes this code, which is why it hasn't been merged yet. I'm also working on some kernel patches which will eliminate the need for this entirely and the protections against this will be moved in-kernel.
Yes, this caused some test failures in Kubernetes's e2e tests. However, the issue is only temporary (when runc is first spawning the container) because the runc binary itself exits as container setup is completed. Personally I think the whole bindfd thing was a bad idea in retrospect, we should've just told Kubernetes that we don't support containers with 5MB memory limits. |
I reviewed and merged #3599, which should fix this issue. |
very good. Thank you |
I have done tests, and this PR did not solve my problem. Instead, from a black-box perspective, it has made my problem worse. My scenario involves deploying 100 pods simultaneously on a single node, with each pod having a container. Before the changes, the high CPU usage of systemd caused many pods to fail deployment. However, there is a chance that the pods could successfully restart. After this PR was merged, the CPU usage of systemd did indeed decrease. But the hang up issue still persists. Many pods still fail to deploy, and retries also continue to fail. Not using try_bindfd, but using memfd has proven to be very effective, all pods succeed in deployment on the first attempt. Even when deploying 170 pods at once, the deployment is still successful. I believe try_bindfd should be discarded and replaced solely with the memfd code. I think it's related to #3599 (comment). |
Description
Upon deploying 150 running Pods on a single node, it was observed that the CPU usage of systemd consistently remained at 50%. When an attempt was made to deploy an additional 100 Pods on the same node, the CPU usage of systemd escalated to 99%, and the newly deployed Pods could not be successfully launched. It is speculated that the issue might be due to an excessive number of mount points. To further investigate this hypothesis, the BCC tools mountsnoop and execsnoop were used to monitor the system calls for mount operations and the process IDs respectively.
mountsnoop is extensively logging.
execsnoop logging
We can see that a large number of mount points were created by runc init.
Analyze the runc source code libcontainer/nsenter/cloned_binary.c.
In try_bindfd(), it says 'We need somewhere to mount it, mounting anything over /proc/self is a BAD idea on the host -- even if we do it temporarily.'
Check the commit history.
nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying
The usage of memfd_create(2) and other copying techniques is quite
wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
memfd_create(2) added ~10M of memory usage to the cgroup associated with
the container, which can result in some setups getting OOM'd (or just
hogging the hosts' memory when you have lots of created-but-not-started
containers sticking around).
Question 1: Is there a better way to avoid the need for a large number of frequent mount and unmount operations?
Question 2: Is the mentioned 10MB memory usage deducted from the memory quota inside the container?
Steps to reproduce the issue
a single node deploys more than 100 pod at the same time with k8s
Describe the results you received and expected
The CPU usage of systemd has significantly decreased.
What version of runc are you using?
1.1.2
Host OS information
x86
Host kernel information
Linux master1 5.10.0-60.18.0.50.h665.eulerosv2r11.x86_64 #1 SMP Fri Dec 23 16:12:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux @kolyshkin @cyphar
The text was updated successfully, but these errors were encountered: