[1.1] nsenter: cloned_binary: remove bindfd logic entirely #4392

cyphar · 2024-09-03T03:24:49Z

Backport of #3931.

While the ro-bind-mount trick did eliminate the memory overhead of copying the runc binary for each "runc init" invocation, on machines with very significant container churn, creating a temporary mount namespace on every container invocation can trigger severe lock contention on namespace_sem that makes containers fail to spawn.

The only reason we added bindfd in commit 16612d7 ("nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying") was due to a Kubernetes e2e test failure where they had a ridiculously small memory limit. It seems incredibly unlikely that real workloads are running without 10MB to spare for the very short time that runc is interacting with the container.

In addition, since the original cloned_binary implementation, cgroupv2 is now almost universally used on modern systems. Unlike cgroupv1, the cgroupv2 memcg implementation does not migrate memory usage when processes change cgroups (even cgroupv1 only did this if you had memory.move_charge_at_immigrate enabled). In addition, because we do the /proc/self/exe clone before synchronising the bootstrap data read, we are guaranteed to do the clone before "runc init" is moved into the container cgroup -- meaning that the memory used by the /proc/self/exe clone is charged against the root cgroup, and thus container workloads should not be affected at all with memfd cloning.

The long-term fix for this problem is to block the /proc/self/exe re-opening attack entirely in-kernel, which is something I'm working on. Though it should also be noted that because the memfd is completely separate to the host binary, even attacks like Dirty COW against the runc binary can be defended against with the memfd approach. Of course, once we have in-kernel protection against the /proc/self/exe re-opening attack, we won't have that protection anymore...

(This is a cherry-pick of b999376.) While the ro-bind-mount trick did eliminate the memory overhead of copying the runc binary for each "runc init" invocation, on machines with very significant container churn, creating a temporary mount namespace on every container invocation can trigger severe lock contention on namespace_sem that makes containers fail to spawn. The only reason we added bindfd in commit 16612d7 ("nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying") was due to a Kubernetes e2e test failure where they had a ridiculously small memory limit. It seems incredibly unlikely that real workloads are running without 10MB to spare for the very short time that runc is interacting with the container. In addition, since the original cloned_binary implementation, cgroupv2 is now almost universally used on modern systems. Unlike cgroupv1, the cgroupv2 memcg implementation does not migrate memory usage when processes change cgroups (even cgroupv1 only did this if you had memory.move_charge_at_immigrate enabled). In addition, because we do the /proc/self/exe clone before synchronising the bootstrap data read, we are guaranteed to do the clone before "runc init" is moved into the container cgroup -- meaning that the memory used by the /proc/self/exe clone is charged against the root cgroup, and thus container workloads should not be affected at all with memfd cloning. The long-term fix for this problem is to block the /proc/self/exe re-opening attack entirely in-kernel, which is something I'm working on[1]. Though it should also be noted that because the memfd is completely separate to the host binary, even attacks like Dirty COW against the runc binary can be defended against with the memfd approach. Of course, once we have in-kernel protection against the /proc/self/exe re-opening attack, we won't have that protection anymore... [1]: https://lwn.net/Articles/934460/ Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

kolyshkin

lgtm

cyphar added area/nsenter backport/1.1-pr A backport PR to release-1.1 labels Sep 3, 2024

cyphar added this to the 1.1.15 milestone Sep 3, 2024

lifubang approved these changes Sep 3, 2024

View reviewed changes

kolyshkin approved these changes Sep 3, 2024

View reviewed changes

kolyshkin merged commit bd671b6 into opencontainers:release-1.1 Sep 3, 2024
28 checks passed

cyphar deleted the 1.1-remove-bindfd branch September 4, 2024 04:55

rata mentioned this pull request Oct 2, 2024

Release v1.1.15 #4422

Merged

This was referenced Oct 7, 2024

update runc binary to 1.1.15 containerd/containerd#10787

Merged

[release/1.6] update runc binary to 1.1.15 containerd/containerd#10795

Closed

cyphar mentioned this pull request Oct 8, 2024

runc 1.1.15 OOMs in Kubernetes e2e tests with containerd, cgroup v2, and cgroupfs driver #4427

Open

austinvazquez mentioned this pull request Oct 9, 2024

Dockerfile: update runc binary to 1.1.15 moby/buildkit#5417

Closed

github-actions bot mentioned this pull request Oct 13, 2024

Bump runc from v1.1.13 to v1.1.15 kokyhm/kubespray#52

Open

lifubang mentioned this pull request Oct 16, 2024

[1.1] join the cgroup after the initial setup finished #4439

Open

borrelm mentioned this pull request Nov 4, 2024

Add by default runc as known_memfd_execution_binaries falcosecurity/rules#266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.1] nsenter: cloned_binary: remove bindfd logic entirely #4392

[1.1] nsenter: cloned_binary: remove bindfd logic entirely #4392

cyphar commented Sep 3, 2024

kolyshkin left a comment

[1.1] nsenter: cloned_binary: remove bindfd logic entirely #4392

[1.1] nsenter: cloned_binary: remove bindfd logic entirely #4392

Conversation

cyphar commented Sep 3, 2024

kolyshkin left a comment

Choose a reason for hiding this comment