Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

entrypoint: fix chicken-and-egg runtime problem #2709

Merged
merged 4 commits into from
Apr 13, 2022

Conversation

kolyshkin
Copy link
Contributor

In case the runtime used to run the KIND container is not aware of some
cgroup subsystems, those will be exposed to the container without proper
scoping (note the rdma and misc):

kir@ubu2110:~$ sudo docker run -i --rm --privileged ubuntu sh -xc 'cat /proc/self/cgroup; grep cgroup /proc/self/mountinfo'
+ cat /proc/self/cgroup
13:pids:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
12:net_cls,net_prio:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
11:hugetlb:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
10:misc:/
9:freezer:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
8:devices:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
7:cpu,cpuacct:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
6:perf_event:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
5:memory:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
4:blkio:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
3:rdma:/
2:cpuset:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
1:name=systemd:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
0::/system.slice/containerd.service
+ grep cgroup /proc/self/mountinfo
666 665 0:65 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,mode=755,inode64
667 666 0:32 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:11 - cgroup cgroup rw,xattr,name=systemd
668 666 0:35 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:15 - cgroup cgroup rw,cpuset
669 666 0:36 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime master:16 - cgroup cgroup rw,rdma
670 666 0:37 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,blkio
671 666 0:38 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,memory
672 666 0:39 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,perf_event
673 666 0:40 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:20 - cgroup cgroup rw,cpu,cpuacct
674 666 0:41 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:21 - cgroup cgroup rw,devices
675 666 0:42 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:22 - cgroup cgroup rw,freezer
676 666 0:43 / /sys/fs/cgroup/misc rw,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,misc
677 666 0:44 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime master:24 - cgroup cgroup rw,hugetlb
678 666 0:45 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:25 - cgroup cgroup rw,net_cls,net_prio
679 666 0:46 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime master:26 - cgroup cgroup rw,pids

Now, if a newer runtime (the one that is aware of e.g. rdma subsystem)
will be used inside this container, it may create cgroups under those
subsystems. Since those are not properly scoped, they will leak to the
host and thus will become non-removable (EBUSY on rmdir).

The workaround, as implemented here, is to hide (unmount and remove)
such unscoped subsystemd.

Fixes kubernetes/kubernetes#109182

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 6, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @kolyshkin!

It looks like this is your first PR to kubernetes-sigs/kind 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kind has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 6, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @kolyshkin. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested review from aojea and munnerz April 6, 2022 20:08
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 6, 2022
@kolyshkin
Copy link
Contributor Author

I briefly tested the code locally:

kir@ubu2110:~$ sudo docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.1-docker)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 20.10.14
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3df54a852345ae127d1fa3092b95168e4a88e2f8
 runc version: v1.0.3-0-gf46b6ba
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.13.0-39-generic
 Operating System: Ubuntu 21.10
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 3.832GiB
 Name: ubu2110
 ID: 4DH7:JJYA:QRUK:3M6K:764C:N5XI:RXZL:3PK3:RZFV:LLSJ:KHL6:D4PQ
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

kir@ubu2110:~$ cat xxx
function fix_cgroup() {
  local current_cgroup
  current_cgroup=$(grep -E '^[^:]*:([^:]*,)?cpu(,[^,:]*)?:.*' /proc/self/cgroup | cut -d: -f3)
  local cgroup_subsystems
  cgroup_subsystems=$(findmnt -lun -o source,target -t cgroup | grep "${current_cgroup}" | awk '{print $2}')
  # Unmount the cgroup subsystems that are not known to runtime used to
  # run the container we are in. Those subsystems are not properly scoped
  # (i.e. the root cgroup is exposed, rather than something like docker/xxxx).
  # In case a runtime (which is aware of more subsystems -- such as rdma,
  # misc, or unified) is used inside the container, it may create cgroups for
  # these subsystems, and as they are not scoped, they will leak to the host
  # and thus will become non-removable.
  #
  # See https://github.com/kubernetes/kubernetes/issues/109182
  local unsupported_cgroups
  unsupported_cgroups=$(findmnt -lun -o source,target -t cgroup | grep -v "${current_cgroup}" | awk '{print $2}')
  local mnt
  echo "$unsupported_cgroups" |
  while IFS= read -r mnt; do
    echo "INFO: unmounting and removing $mnt"
    umount "$mnt" || true
    rmdir "$mnt" || true
  done
}

fix_cgroup
kir@ubu2110:~$ cat xxx | sudo docker run -i --rm --privileged ubuntu bash
INFO: unmounting and removing /sys/fs/cgroup/rdma
INFO: unmounting and removing /sys/fs/cgroup/misc

@aojea
Copy link
Contributor

aojea commented Apr 6, 2022

/assign @BenTheElder @AkihiroSuda
/assign

@kolyshkin
Copy link
Contributor Author

A few important things worth noting/reiterating:

  1. This workaround only makes sense for cgroup v1. For v2, we use cgroupns.

  2. Cgroup v1 setups can probably also use cgroupns to eliminate the issue (and in this case the whole fix_cgroup won't be needed!), but I haven't looked into it.

  3. This workaround only makes sense in a situation when the runtime used to create a KIND container is sufficiently newer than the runtime used inside the container (as in, runc 1.0.x on the host and runc 1.1.x inside a KIND container). If the inner/outer runtimes are the same, subsystems unknown to them are not used. If the outer runtime is newer, inner runtime still doesn't know about new subsystems. OTOH having it does no harm.

  4. The existing issue with rdma and unified subsystems may repeat in the future with the misc subsystemd if/once runc starts to support it. Same would happen with any other new cgroup v1 subsystem that kernel might want to add (and runc -- to support). So this workaround still makes sense to have, basically eliminating any non-scoped subsystems from the container.

  5. For some reason, I do not see /sys/fs/cgroup/unified (aka hybrid hierarchy) being mounted in my setup, but it seems it is the case for kubernetes CI (as unified appears to be mounted). Thus, I haven't checked if my code fixes the issue with unified, but I believe it is.

  6. I also checked it with podman+crun (as installed from the default ubuntu repo) on the same Ubuntu 21.10 machine. In there, all the cgroups (even misc) are properly scoped. So, yet another possible solution to this problem is to replace docker with podman (on the host). Might bring other issues, of course.

@kolyshkin kolyshkin force-pushed the fix-cgroups branch 2 times, most recently from f9171d9 to 35c38bd Compare April 7, 2022 00:37
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 7, 2022
#
# See: https://man7.org/linux/man-pages/man7/cgroups.7.html
local current_cgroup
current_cgroup=$(grep -E '^[^:]*:([^:]*,)?cpu(,[^,:]*)?:.*' /proc/self/cgroup | cut -d: -f3)
if [ "$current_cgroup" = "/" ]; then
echo "INFO: cgroupns detected, no need to fix cgroups"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity I'd suggest always enabling --cgroupns=private.
This is not present in Docker 19.03 CLI, but 19.03 has already reached EOL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur (see #2709 (comment) which I wrote yesterday but forgot to click "Comment").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we can document that KIND now requires --cgroupns=private, and replace the whole fix_cgroup with a check that we got a private cgroupns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure though it if will work for other environments. It does work for podman (checked with a version bundled with Ubuntu 21.10). The script also mentions "Cloud Shell" (not sure what's that but seems to be k8s based) and GitHub Actions.

In fact, I think I know what to do. Please see the last commit (just updated).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google Cloud Shell supports Docker 20.10, so --cgroupns=private should work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed

$ cat /proc/self/cgroup
...
2:cpu,cpuacct:/kubepods/besteffort/pod...
...

$ docker run -it --rm --cgroupns=private alpine cat /proc/self/cgroup
...
2:cpu,cpuacct:/
...

@kolyshkin
Copy link
Contributor Author

Added another commit with detects cgroupns on cgroup v1 and skips the cgroup setup.

Indeed, using cgroupns=private should

  • eliminate the issue we're fixing here
  • eliminate the need for complex cgroup setup.

We could go one step further and require cgroupns instead.

return
fi

echo 'WARNING: cgroupns is not enabled! Please run with --cgroupns=private,'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be confusing to end users because the kind CLI does not have --cgroupns flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to just suggest cgroup v2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As kind is used to test kubernetes, we need to support both v1 and v2, otherwise we won't test it.

What about something like this:

  echo 'WARNING: cgroupns is not enabled! Please use cgroup v2 or cgroup v1 with cgroupns enabled'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified accordingly, PTAL

@aojea
Copy link
Contributor

aojea commented Apr 8, 2022

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 8, 2022
fi

echo 'WARN: cgroupns not enabled! Please use cgroup v2, or cgroup v1 with cgroupns enabled.'
echo 'WARN: Support for host cgroup namespace will be removed!'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we're actually going to do that, hacky as it is.
The line above is good, I think this is my only nit with this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that such a message should encourage more users to enable cgroupns, and not force us to actually do anything (as the message does not specify when such support will be removed).

Anyway, message removed, and I rewrote the TODO item into the NOTE.

PR updated, PTAL @BenTheElder

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

BTW, For the record: Kubernetes's CI is unfortunately one of these places still.

@BenTheElder
Copy link
Member

Thank you very much for looking into this issue and for the PR 🙏
I only have one complaint #2709 (comment), otherwise we should push a new image, test and merge this @aojea

@BenTheElder
Copy link
Member

[missed clicking "comment" on this earlier]

We could go one step further and require cgroupns instead.

I would prefer not to break existing users where it is possible to workaround, AFAICT --cgroupns=private is a non-default dockerd setting, but at least with time cgroupsv2 will take over anyhow ... 🤞

72433e5 seems like a pretty elegant workaround thanks!

Implement shellcheck suggestion:

> SC2126 (style): Consider using grep -c instead of grep|wc -l.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case the runtime used to run the KIND container is not aware of some
cgroup subsystems, those will be exposed to the container without proper
scoping (note the rdma and misc):

	kir@ubu2110:~$ sudo docker run -i --rm --privileged ubuntu sh -xc 'cat /proc/self/cgroup; grep cgroup /proc/self/mountinfo'
	+ cat /proc/self/cgroup
	13:pids:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	12:net_cls,net_prio:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	11:hugetlb:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	10:misc:/
	9:freezer:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	8:devices:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	7:cpu,cpuacct:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	6:perf_event:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	5:memory:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	4:blkio:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	3:rdma:/
	2:cpuset:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	1:name=systemd:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	0::/system.slice/containerd.service
	+ grep cgroup /proc/self/mountinfo
	666 665 0:65 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,mode=755,inode64
	667 666 0:32 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:11 - cgroup cgroup rw,xattr,name=systemd
	668 666 0:35 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:15 - cgroup cgroup rw,cpuset
	669 666 0:36 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime master:16 - cgroup cgroup rw,rdma
	670 666 0:37 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,blkio
	671 666 0:38 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,memory
	672 666 0:39 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,perf_event
	673 666 0:40 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:20 - cgroup cgroup rw,cpu,cpuacct
	674 666 0:41 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:21 - cgroup cgroup rw,devices
	675 666 0:42 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:22 - cgroup cgroup rw,freezer
	676 666 0:43 / /sys/fs/cgroup/misc rw,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,misc
	677 666 0:44 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime master:24 - cgroup cgroup rw,hugetlb
	678 666 0:45 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:25 - cgroup cgroup rw,net_cls,net_prio
	679 666 0:46 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime master:26 - cgroup cgroup rw,pids

Now, if a newer runtime (the one that is aware of e.g. rdma subsystem)
will be used inside this container, it may create cgroups under those
subsystems. Since those are not properly scoped, they will leak to the
host and thus will become non-removable (EBUSY on rmdir).

The workaround, as implemented here, is to hide (unmount and remove)
such unscoped subsystemd.

Fixes kubernetes/kubernetes#109182

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case the KIND container is run on a cgroup v1 host with --cgroupns=private,
there is no need to do any dances with cgroups.

Detect this and skip cgroup setup.

Otherwise, print a warning that the host cgroupns support is not
enabled, encouraging users to do so. Add a comment to the text to
help future generation of kind maintainers to figure out why this
code is needed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

I would prefer not to break existing users where it is possible to workaround, AFAICT --cgroupns=private is a non-default dockerd setting, but at least with time cgroupsv2 will take over anyhow ... 🤞

It is probably a non-default for the sake of such workarounds that we have here. In fact, this is yet another chicken and egg problem. I mean, nothing should break when cgroupns is enabled for cgroup v2, except some kludges like this.

In the ideal word, cgroupns should have been enabled by default, and the users would adopt (by either disabling it or removing the workarounds/kludges). I hope this will eventually happen, but remembering the time it took to fix a bad default for container's IPC mode, it can take a few years.

@kolyshkin
Copy link
Contributor Author

Rebased; addressed the review comments.

@aojea
Copy link
Contributor

aojea commented Apr 13, 2022

neat, thanks for the great work
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 13, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, kolyshkin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2022
@aojea
Copy link
Contributor

aojea commented Apr 13, 2022

#30 [linux/s390x stage-0 4/6] RUN go mod download
#0 0.540 Segmentation fault (core dumped)
#30 ERROR: process "/bin/sh -c go mod download" did not complete successfully: exit code: 139

/test pull-kind-build

@k8s-ci-robot k8s-ci-robot merged commit 39a7f09 into kubernetes-sigs:main Apr 13, 2022
#
# See https://github.com/kubernetes/kubernetes/issues/109182
local unsupported_cgroups
unsupported_cgroups=$(findmnt -lun -o source,target -t cgroup | grep -v "${current_cgroup}" | awk '{print $2}')
Copy link
Member

@BenTheElder BenTheElder May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference: #2743

We missed that grep will exit non-zero (1) on no-match and no-match is possible here + we have pipefail on to catch errors.

Easy fix!

podman-in-github-actions has all the cgroups

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
5 participants