Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libctr/cgroups: don't take init's cgroup into account #3784

Merged
merged 1 commit into from
Apr 4, 2023

Conversation

haircommander
Copy link
Contributor

Sometimes, the init process is not in the root cgroup. This can be noted by GetInitPath, which already scrubs the path of init.scope.

This was encountered when trying to patch the Kubelet to handle systemd being in a separate cpuset from root (to allow load balance disabling for containers). At present, there's no way to have libcontainer or runc manage cgroups in a hierarchy outside of the one init is in (unless the path contains init.scope, which is limiting)

@haircommander haircommander force-pushed the root-cgroup-no-init branch 2 times, most recently from bfbc355 to c7d399a Compare March 27, 2023 17:15
Sometimes, the init process is not in the root cgroup.
This can be noted by GetInitPath, which already scrubs the path of `init.scope`.

This was encountered when trying to patch the Kubelet to handle systemd being in a separate cpuset
from root (to allow load balance disabling for containers). At present, there's no way to have libcontainer or runc
manage cgroups in a hierarchy outside of the one init is in (unless the path contains `init.scope`, which is limiting)

Signed-off-by: Peter Hunt <pehunt@redhat.com>
@kolyshkin
Copy link
Contributor

I traced this feature back to commit 664fc54e65ebc14ca (see moby/moby@664fc54). @mrunalp says it was added for some cases like docker-in-docker.

The biggest issue though is I can't envision what exactly will be broken (if anything) if this code gets removed. Perhaps nothing will. For one thing, I can't find anything like that in crun.

@kolyshkin
Copy link
Contributor

@opencontainers/runc-maintainers what do you think about this?

In short, this allows runc to be used in scenarios where init (systemd, pid 1) process is not in the top-level cgroup. The requirement for this comes from the CPU load balancing API change in recent kernels, which requires, in some scenarios, for systemd to be moved out from the top-level cgroup. From my perspective, there's absolutely no way to work around this.

Now, this change will not break any common setups. It might break some old docker in docker setups on some very old kernels. The problem which the code being removed is trying to solve is now solved by having cgroup namespace in the kernel. Cgroup namespace is available since kernel 4.6, released in May 2016.

@thaJeztah is there a way to test runc binary from this PR (available from Checks -> validate -> Artefacts (scroll to the very bottom of the page, i.e. https://github.com/opencontainers/runc/actions/runs/4534733741) in some kind of DinD scenario?

On a related note, I remember fixing an issue with KinD (kubernetes-sigs/kind#2709) which would not exist if we'd always enable cgroupns. So maybe, together with this, we should be bold and brave and go ahead to enable cgroupns by default in cgroup v1 case, if available (currently we only do this with cgroup v2).

@mrunalp
Copy link
Contributor

mrunalp commented Mar 31, 2023

I traced this feature back to commit 664fc54e65ebc14ca (see moby/moby@664fc54). @mrunalp says it was added for some cases like docker-in-docker.

That looks older than when I thought it was first added. So, I am not sure that it was related to docker in docker but perhaps non-standard v1 cgroups layouts back then across distros which shouldn't be a case now.

I am 👍 to merging this.

@haircommander
Copy link
Contributor Author

another question, in addition to maintainer's comfort with it, is is this PR worthy of backporting to release-1.1?

Copy link
Contributor

@mrunalp mrunalp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@haircommander
Copy link
Contributor Author

@AkihiroSuda @cyphar @thaJeztah PTAL

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolyshkin kolyshkin closed this Mar 31, 2023
@kolyshkin kolyshkin reopened this Mar 31, 2023
@haircommander
Copy link
Contributor Author

@AkihiroSuda @cyphar @thaJeztah PTAL

@thaJeztah
Copy link
Member

@thaJeztah is there a way to test runc binary from this PR (available from Checks -> validate -> Artefacts (scroll to the very bottom of the page, i.e. https://github.com/opencontainers/runc/actions/runs/4534733741) in some kind of DinD scenario?

Wah, sorry for the delay. Been kinda busy.

I guess one way would be to pull a dind image from https://hub.docker.com/_/docker, and either copy the runc binary into the image, or to bind-mount it to swap the one in the image. 🤔

@kolyshkin
Copy link
Contributor

I guess one way would be to pull a dind image from https://hub.docker.com/_/docker, and either copy the runc binary into the image, or to bind-mount it to swap the one in the image.

I've tested this on CentOS 7 (kernel 3.10, no cgroupns) using docker 1.31.1 (RH fork) and the runc binary from this PR inside the docker:dind image. Works fine, the containers created inside dind are added to the proper cgroups, the limits are set (tested with docker run --memory). So, it works inside dind on an older without cgroupns.

I haven't tried the new runc on the host because docker 1.13.1 uses ancient runc with --console option removed by #1018. So I have installed one from docker.com, and tried to use it with runc from this PR on the host AND inside the docker:dind container. Works fine.

Changed the docker cgroup driver to systemd, works fine.

To sum it up -- I was not able to find a scenario in which the kludge which is removed by this PR is needed.

So, unless someone from @opencontainers/runc-maintainers says "no", let's merge this tomorrow.

@kolyshkin kolyshkin merged commit cc60a39 into opencontainers:main Apr 4, 2023
@kolyshkin
Copy link
Contributor

One more post-merge consideration.

This is a part of systemd v1 driver, and so to test it out we would want to use dind with systemd. Which is not possible, because the official docker:dind image is running inside a bare-bone alpine linux container, which does not have systemd running. The error message is:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: systemd not running on this host, cannot use systemd cgroups manager: unknown.

I am not sure how to test dind with systemd.

@kolyshkin
Copy link
Contributor

Some more archaeology and post-merge reflection...

Commit 2b28b3c (PR #194) removed getting init cgroup because in case of host pidns this points to a different hierarchy. This change was done for the fs driver only, not for the systemd driver (which already existed at that time).

Taking into account that

  • DIND was/is never used with systemd driver (I guess)
  • the code removed by this PR initially came as a copy-paste from the fs driver
  • the copy-paste happened before the above mentioned commit 2b28b3c

Thus, the code is not needed.

The other question is, do we need to use something like GetOwnCgroupPath for dind+systemd case? Maybe, but the thing is, there's no such thing as dind+systemd.

Fingers crossed that cgroup v1 will be obsoleted soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.1-done A PR in main branch which has been backported to release-1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants