-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
entrypoint: fix chicken-and-egg runtime problem #2709
Conversation
Welcome @kolyshkin! |
Hi @kolyshkin. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I briefly tested the code locally: kir@ubu2110:~$ sudo docker info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Docker Buildx (Docker Inc., v0.8.1-docker)
scan: Docker Scan (Docker Inc., v0.17.0)
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 1
Server Version: 20.10.14
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3df54a852345ae127d1fa3092b95168e4a88e2f8
runc version: v1.0.3-0-gf46b6ba
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.13.0-39-generic
Operating System: Ubuntu 21.10
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 3.832GiB
Name: ubu2110
ID: 4DH7:JJYA:QRUK:3M6K:764C:N5XI:RXZL:3PK3:RZFV:LLSJ:KHL6:D4PQ
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
kir@ubu2110:~$ cat xxx
function fix_cgroup() {
local current_cgroup
current_cgroup=$(grep -E '^[^:]*:([^:]*,)?cpu(,[^,:]*)?:.*' /proc/self/cgroup | cut -d: -f3)
local cgroup_subsystems
cgroup_subsystems=$(findmnt -lun -o source,target -t cgroup | grep "${current_cgroup}" | awk '{print $2}')
# Unmount the cgroup subsystems that are not known to runtime used to
# run the container we are in. Those subsystems are not properly scoped
# (i.e. the root cgroup is exposed, rather than something like docker/xxxx).
# In case a runtime (which is aware of more subsystems -- such as rdma,
# misc, or unified) is used inside the container, it may create cgroups for
# these subsystems, and as they are not scoped, they will leak to the host
# and thus will become non-removable.
#
# See https://github.com/kubernetes/kubernetes/issues/109182
local unsupported_cgroups
unsupported_cgroups=$(findmnt -lun -o source,target -t cgroup | grep -v "${current_cgroup}" | awk '{print $2}')
local mnt
echo "$unsupported_cgroups" |
while IFS= read -r mnt; do
echo "INFO: unmounting and removing $mnt"
umount "$mnt" || true
rmdir "$mnt" || true
done
}
fix_cgroup
kir@ubu2110:~$ cat xxx | sudo docker run -i --rm --privileged ubuntu bash
INFO: unmounting and removing /sys/fs/cgroup/rdma
INFO: unmounting and removing /sys/fs/cgroup/misc |
/assign @BenTheElder @AkihiroSuda |
A few important things worth noting/reiterating:
|
f9171d9
to
35c38bd
Compare
# | ||
# See: https://man7.org/linux/man-pages/man7/cgroups.7.html | ||
local current_cgroup | ||
current_cgroup=$(grep -E '^[^:]*:([^:]*,)?cpu(,[^,:]*)?:.*' /proc/self/cgroup | cut -d: -f3) | ||
if [ "$current_cgroup" = "/" ]; then | ||
echo "INFO: cgroupns detected, no need to fix cgroups" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity I'd suggest always enabling --cgroupns=private
.
This is not present in Docker 19.03 CLI, but 19.03 has already reached EOL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur (see #2709 (comment) which I wrote yesterday but forgot to click "Comment").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we can document that KIND now requires --cgroupns=private
, and replace the whole fix_cgroup
with a check that we got a private cgroupns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure though it if will work for other environments. It does work for podman (checked with a version bundled with Ubuntu 21.10). The script also mentions "Cloud Shell" (not sure what's that but seems to be k8s based) and GitHub Actions.
In fact, I think I know what to do. Please see the last commit (just updated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google Cloud Shell supports Docker 20.10, so --cgroupns=private
should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed
$ cat /proc/self/cgroup
...
2:cpu,cpuacct:/kubepods/besteffort/pod...
...
$ docker run -it --rm --cgroupns=private alpine cat /proc/self/cgroup
...
2:cpu,cpuacct:/
...
Added another commit with detects cgroupns on cgroup v1 and skips the cgroup setup. Indeed, using
We could go one step further and require cgroupns instead. |
return | ||
fi | ||
|
||
echo 'WARNING: cgroupns is not enabled! Please run with --cgroupns=private,' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be confusing to end users because the kind
CLI does not have --cgroupns
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to just suggest cgroup v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As kind is used to test kubernetes, we need to support both v1 and v2, otherwise we won't test it.
What about something like this:
echo 'WARNING: cgroupns is not enabled! Please use cgroup v2 or cgroup v1 with cgroupns enabled'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified accordingly, PTAL
/ok-to-test |
fi | ||
|
||
echo 'WARN: cgroupns not enabled! Please use cgroup v2, or cgroup v1 with cgroupns enabled.' | ||
echo 'WARN: Support for host cgroup namespace will be removed!' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we're actually going to do that, hacky as it is.
The line above is good, I think this is my only nit with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that such a message should encourage more users to enable cgroupns, and not force us to actually do anything (as the message does not specify when such support will be removed).
Anyway, message removed, and I rewrote the TODO item into the NOTE.
PR updated, PTAL @BenTheElder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
BTW, For the record: Kubernetes's CI is unfortunately one of these places still.
Thank you very much for looking into this issue and for the PR 🙏 |
[missed clicking "comment" on this earlier]
I would prefer not to break existing users where it is possible to workaround, AFAICT 72433e5 seems like a pretty elegant workaround thanks! |
Implement shellcheck suggestion: > SC2126 (style): Consider using grep -c instead of grep|wc -l. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case the runtime used to run the KIND container is not aware of some cgroup subsystems, those will be exposed to the container without proper scoping (note the rdma and misc): kir@ubu2110:~$ sudo docker run -i --rm --privileged ubuntu sh -xc 'cat /proc/self/cgroup; grep cgroup /proc/self/mountinfo' + cat /proc/self/cgroup 13:pids:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b 12:net_cls,net_prio:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b 11:hugetlb:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b 10:misc:/ 9:freezer:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b 8:devices:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b 7:cpu,cpuacct:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b 6:perf_event:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b 5:memory:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b 4:blkio:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b 3:rdma:/ 2:cpuset:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b 1:name=systemd:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b 0::/system.slice/containerd.service + grep cgroup /proc/self/mountinfo 666 665 0:65 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,mode=755,inode64 667 666 0:32 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:11 - cgroup cgroup rw,xattr,name=systemd 668 666 0:35 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:15 - cgroup cgroup rw,cpuset 669 666 0:36 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime master:16 - cgroup cgroup rw,rdma 670 666 0:37 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,blkio 671 666 0:38 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,memory 672 666 0:39 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,perf_event 673 666 0:40 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:20 - cgroup cgroup rw,cpu,cpuacct 674 666 0:41 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:21 - cgroup cgroup rw,devices 675 666 0:42 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:22 - cgroup cgroup rw,freezer 676 666 0:43 / /sys/fs/cgroup/misc rw,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,misc 677 666 0:44 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime master:24 - cgroup cgroup rw,hugetlb 678 666 0:45 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:25 - cgroup cgroup rw,net_cls,net_prio 679 666 0:46 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime master:26 - cgroup cgroup rw,pids Now, if a newer runtime (the one that is aware of e.g. rdma subsystem) will be used inside this container, it may create cgroups under those subsystems. Since those are not properly scoped, they will leak to the host and thus will become non-removable (EBUSY on rmdir). The workaround, as implemented here, is to hide (unmount and remove) such unscoped subsystemd. Fixes kubernetes/kubernetes#109182 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case the KIND container is run on a cgroup v1 host with --cgroupns=private, there is no need to do any dances with cgroups. Detect this and skip cgroup setup. Otherwise, print a warning that the host cgroupns support is not enabled, encouraging users to do so. Add a comment to the text to help future generation of kind maintainers to figure out why this code is needed. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It is probably a non-default for the sake of such workarounds that we have here. In fact, this is yet another chicken and egg problem. I mean, nothing should break when cgroupns is enabled for cgroup v2, except some kludges like this. In the ideal word, cgroupns should have been enabled by default, and the users would adopt (by either disabling it or removing the workarounds/kludges). I hope this will eventually happen, but remembering the time it took to fix a bad default for container's IPC mode, it can take a few years. |
Rebased; addressed the review comments. |
neat, thanks for the great work |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, kolyshkin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
# | ||
# See https://github.com/kubernetes/kubernetes/issues/109182 | ||
local unsupported_cgroups | ||
unsupported_cgroups=$(findmnt -lun -o source,target -t cgroup | grep -v "${current_cgroup}" | awk '{print $2}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference: #2743
We missed that grep will exit non-zero (1) on no-match and no-match is possible here + we have pipefail on to catch errors.
Easy fix!
podman-in-github-actions has all the cgroups
In case the runtime used to run the KIND container is not aware of some
cgroup subsystems, those will be exposed to the container without proper
scoping (note the rdma and misc):
Now, if a newer runtime (the one that is aware of e.g. rdma subsystem)
will be used inside this container, it may create cgroups under those
subsystems. Since those are not properly scoped, they will leak to the
host and thus will become non-removable (EBUSY on rmdir).
The workaround, as implemented here, is to hide (unmount and remove)
such unscoped subsystemd.
Fixes kubernetes/kubernetes#109182