libct: reset CPU affinity by default #4858

cyphar · 2025-08-19T07:03:38Z

This is a much simpler alternative to #4815 and includes more self-explanatory tests.
EDIT: #4041 is a similar approach but AFAICS it actually wouldn't work.

In certain deployments, it's possible for runc to be spawned by a
process with a restrictive cpumask (such as from a systemd unit with
CPUAffinity=... configured) which will be inherited by runc and thus the
container process by default.

The cpuset cgroup used to reconfigure the cpumask automatically for
joining processes, but torvalds/linux@da019032819a ("sched: Enforce user
requested affinity") changed this behaviour in Linux 6.2.

The solution is to try to emulate the expected behaviour by resetting
our cpumask to correspond with the configured cpuset (in the case of
runc exec, if the user did not configure an alternative one). Normally
we would have to parse /proc/stat and /sys/fs/cgroup, but luckily
sched_setaffinity(2) will transparently convert an all-set cpumask (even
if it has more entries than the number of CPUs on the system) to the
correct value for our usecase.

Closes #4815
Closes #4041
Reported-by: ningmingxiao ning.mingxiao@zte.com.cn
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

cyphar · 2025-08-19T07:14:08Z

FYI, passing nil as done in MarSik@e6ce3af does not actually work (as I suspected). You need to provide a plausible-looking cpumask.

cyphar · 2025-08-19T17:27:45Z

Strangely, the rootless --systemd-cgroup tests appear to indicate that the cpuset is not applied to the cgroup (and so the affinity is reset to all CPUs). Maybe I'm misconfiguring the rootless cgroup stuff in the test, or it's some kind of race condition with systemd (though that seems unlikely) but it seems unrelated to this patch so I've just worked around it slightly.

kolyshkin · 2025-08-20T21:53:57Z

tests/integration/helpers.bash

-# Wrapper for runc.
-function runc() {
-	run __runc "$@"
+# Wrapper around "run2 that logs output to make tests easier to debug.


In fact, bats has --verbose-run now which kind of removes the need for such helper. I haven't tried that though, might not be that useful.

I can take a look in a separate PR -- I added this wrapper back in 2016, given all the work since then I'm sure they've finally resolved this (fairly common) pain point by now.

tests/integration/helpers.bash

kolyshkin · 2025-08-20T22:01:20Z

There's a kernel patch that would help tremendously: https://lore.kernel.org/lkml/20220922180041.1768141-1-longman@redhat.com. With it, empty cpuset is treated as "set to all available cpus". Alas, it's not merged (but I think it is added to RHEL kernels).

It avoids somewhat slow calculations on every exec so the fix is way simpler (see #4041).

cyphar · 2025-08-21T00:14:56Z

@kolyshkin AFAICS, that patchset was merged in Linux 6.2 (torvalds/linux@bf57ae2) -- I linked the merged version of patch 3 in the description (torvalds/linux@da019032819a).

As for using the empty set instead, it looks to me that the idea of #4041 was to use task_user_cpus from patch 1 (torvalds/linux@8f9ea86fdf99b). However, this is only used by force_compatible_cpus_allowed_ptr and relax_compatible_cpus_allowed_ptr which are only used by arm64 to force a particular CPU affinity.

AFAICS, using an empty CPU mask will just give you -EINVAL from __set_cpus_allowed_ptr and won't change the affinity:

static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
					 struct affinity_context *ctx,
					 struct rq *rq,
					 struct rq_flags *rf)
{
	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
	const struct cpumask *cpu_valid_mask = cpu_active_mask;
	bool kthread = p->flags & PF_KTHREAD;
	unsigned int dest_cpu;
	int ret = 0;

	/* ... */

	if (!kthread && !cpumask_subset(ctx->new_mask, cpu_allowed_mask)) {
		ret = -EINVAL;
		goto out;
	}

	/* ... */
}

(I also tested this locally and it doesn't work either, even if you ignore -EINVAL.)

Since unix.CPUSet is actually just an array, we could forcefully set all of the values to 0xFFFF_FFFF_FFFF_FFFF (unfortunately the size is architecture-specific and might be 32bit or 64bit -- and because it uses an internal type we can't use ^0 or generics to make the compiler happy).

That being said, I really don't think this is going to be a performance problem. The overhead we're talking about is 1us (we could cache it with sync.OnceValue but it only gets called once):

goos: linux
goarch: amd64
pkg: sched
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkCPUSet_Set1024
BenchmarkCPUSet_Set1024-16                        718982              1483 ns/op
BenchmarkCPUSet_Set1024-16                        711818              1433 ns/op
BenchmarkCPUSet_Set1024-16                        809596              1473 ns/op
BenchmarkCPUSet_Set1024-16                        798969              1507 ns/op
BenchmarkCPUSet_Set1024-16                        779511              1505 ns/op
BenchmarkCPUSet_Set1024_OnceValue
BenchmarkCPUSet_Set1024_OnceValue-16            174327886                6.811 ns/op
BenchmarkCPUSet_Set1024_OnceValue-16            168295525                7.161 ns/op
BenchmarkCPUSet_Set1024_OnceValue-16            166972314                7.128 ns/op
BenchmarkCPUSet_Set1024_OnceValue-16            170419455                7.075 ns/op
BenchmarkCPUSet_Set1024_OnceValue-16            174876843                6.901 ns/op
BenchmarkCPUSet_Manual0xFFFF
BenchmarkCPUSet_Manual0xFFFF-16                 277939233                4.388 ns/op
BenchmarkCPUSet_Manual0xFFFF-16                 265586181                4.411 ns/op
BenchmarkCPUSet_Manual0xFFFF-16                 275293765                4.323 ns/op
BenchmarkCPUSet_Manual0xFFFF-16                 272910668                4.408 ns/op
BenchmarkCPUSet_Manual0xFFFF-16                 272473363                4.272 ns/op
BenchmarkSetaffnity_1024
BenchmarkSetaffnity_1024-16                       463239              2256 ns/op
BenchmarkSetaffnity_1024-16                       457530              2353 ns/op
BenchmarkSetaffnity_1024-16                       459678              2337 ns/op
BenchmarkSetaffnity_1024-16                       469249              2338 ns/op
BenchmarkSetaffnity_1024-16                       460647              2272 ns/op
BenchmarkSetaffnity_Set1024_OnceValue
BenchmarkSetaffnity_Set1024_OnceValue-16         1522971               786.6 ns/op
BenchmarkSetaffnity_Set1024_OnceValue-16         1552382               773.9 ns/op
BenchmarkSetaffnity_Set1024_OnceValue-16         1553419               773.5 ns/op
BenchmarkSetaffnity_Set1024_OnceValue-16         1551712               781.7 ns/op
BenchmarkSetaffnity_Set1024_OnceValue-16         1552219               774.2 ns/op
BenchmarkSetaffnity_Manual0xFFFF
BenchmarkSetaffnity_Manual0xFFFF-16              1538379               776.8 ns/op
BenchmarkSetaffnity_Manual0xFFFF-16              1546282               778.9 ns/op
BenchmarkSetaffnity_Manual0xFFFF-16              1534801               779.8 ns/op
BenchmarkSetaffnity_Manual0xFFFF-16              1537387               776.9 ns/op
BenchmarkSetaffnity_Manual0xFFFF-16              1503338               790.1 ns/op
BenchmarkSetaffinity_Empty
BenchmarkSetaffinity_Empty-16                    1654030               724.1 ns/op
BenchmarkSetaffinity_Empty-16                    1583617               754.2 ns/op
BenchmarkSetaffinity_Empty-16                    1579324               756.2 ns/op
BenchmarkSetaffinity_Empty-16                    1597748               752.7 ns/op
BenchmarkSetaffinity_Empty-16                    1587476               750.4 ns/op
PASS
ok      sched   40.927s

cyphar · 2025-08-21T01:16:29Z

Another alternative which doesn't require per-architecture code and is 10x faster is:

cpuset := unix.CPUSet{}
for i := range bits.UintSize {
	cpuset.Set(i)
}
for i := 1; i < len(cpuset); i *= 2 {
	copy(cpuset[i:], cpuset[:i])
}

Or even better (if you don't mind hacky solutions):

cpuset := unix.CPUSet{}
for i := range cpuset {
	cpuset[i]-- // underflow to 0xFF..FF
}

(Naive is just copying cpuset[0] to each value.)

goos: linux
goarch: amd64
pkg: sched
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkCPUSet_Set1024
BenchmarkCPUSet_Set1024-16                        688843              1497 ns/op
BenchmarkCPUSet_Set1024-16                        710368              1517 ns/op
BenchmarkCPUSet_Set1024-16                        709746              1513 ns/op
BenchmarkCPUSet_Set1024-16                        706272              1538 ns/op
BenchmarkCPUSet_Set1024-16                        774168              1478 ns/op
BenchmarkCPUSet_Underflow
BenchmarkCPUSet_Underflow-16                    235444515                5.031 ns/op
BenchmarkCPUSet_Underflow-16                    228036662                5.161 ns/op
BenchmarkCPUSet_Underflow-16                    239377489                5.005 ns/op
BenchmarkCPUSet_Underflow-16                    233470130                5.057 ns/op
BenchmarkCPUSet_Underflow-16                    230048280                5.208 ns/op
BenchmarkCPUSet_SetUintsize_CopyNaive
BenchmarkCPUSet_SetUintsize_CopyNaive-16        11296006               103.5 ns/op
BenchmarkCPUSet_SetUintsize_CopyNaive-16        10889908               108.9 ns/op
BenchmarkCPUSet_SetUintsize_CopyNaive-16        11091403               104.1 ns/op
BenchmarkCPUSet_SetUintsize_CopyNaive-16        11666511               105.4 ns/op
BenchmarkCPUSet_SetUintsize_CopyNaive-16        11597973               103.6 ns/op
BenchmarkCPUSet_SetUintsize_CopyBinary
BenchmarkCPUSet_SetUintsize_CopyBinary-16       12549202                94.91 ns/op
BenchmarkCPUSet_SetUintsize_CopyBinary-16       12601789                94.91 ns/op
BenchmarkCPUSet_SetUintsize_CopyBinary-16       12675290                95.59 ns/op
BenchmarkCPUSet_SetUintsize_CopyBinary-16       11692279               100.9 ns/op
BenchmarkCPUSet_SetUintsize_CopyBinary-16       11431554               102.7 ns/op

cyphar · 2025-08-21T01:35:54Z

I pushed the hacky underflow solution.

tests/integration/cpu_affinity.bats

"runc" was a special wrapper around bats's "run" which output some very useful diagnostic information to the bats log, but this was not usable for other commands. So let's make it a more generic helper that we can use for other commands. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Sometimes we need to run runc through some wrapper (like nohup), but because "__runc" and "runc" are bash functions in our test suite this doesn't work trivially -- and you cannot just pass "$RUNC" because you you need to set --root for rootless tests. So create a setup_runc_cmdline helper which sets $RUNC_CMDLINE to the beginning cmdline used by __runc (and switch __runc to use that). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

In certain deployments, it's possible for runc to be spawned by a process with a restrictive cpumask (such as from a systemd unit with CPUAffinity=... configured) which will be inherited by runc and thus the container process by default. The cpuset cgroup used to reconfigure the cpumask automatically for joining processes, but kcommit da019032819a ("sched: Enforce user requested affinity") changed this behaviour in Linux 6.2. The solution is to try to emulate the expected behaviour by resetting our cpumask to correspond with the configured cpuset (in the case of "runc exec", if the user did not configure an alternative one). Normally we would have to parse /proc/stat and /sys/fs/cgroup, but luckily sched_setaffinity(2) will transparently convert an all-set cpumask (even if it has more entries than the number of CPUs on the system) to the correct value for our usecase. For some reason, in our CI it seems that rootless --systemd-cgroup results in the cpuset (presumably temporarily?) being configured such that sched_setaffinity(2) will allow the full set of CPUs. For this particular case, all we care about is that it is different to the original set, so include some special-casing (but we should probably investigate this further...). Reported-by: ningmingxiao <ning.mingxiao@zte.com.cn> Reported-by: Martin Sivak <msivak@redhat.com> Reported-by: Peter Hunt <pehunt@redhat.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

kolyshkin

LGTM

brandond · 2025-08-28T01:06:29Z

@cyphar any idea how far out the next 1.3 release is?

cyphar · 2025-08-28T02:49:20Z

We can probably cut a release now, but I also want to get 1.4.0-rc1 ready soon as well. There's also at least one other PR for 1.3.1.

cwayne18 · 2025-09-02T20:07:28Z

is there also a plan to backport this for a 1.2.7 release?

cyphar · 2025-09-03T00:47:11Z

@cwayne18 It was already backported to the 1.2.x branch in #4869.

brandond · 2025-09-03T01:07:55Z

Awesome! Can we expect to see releases from these branches within the next day or so?

cyphar · 2025-09-03T01:14:48Z

I will be sending release PRs this week.

e-minguez · 2025-09-04T09:28:50Z

@cwayne18 @brandond #4878

cyphar mentioned this pull request Aug 19, 2025

fix: cpu affinity #4815

Closed

cyphar force-pushed the reset-cpu-affinity branch from 1e8b14f to 61891c9 Compare August 19, 2025 07:07

cyphar force-pushed the reset-cpu-affinity branch 10 times, most recently from b0110a8 to a6dcf85 Compare August 19, 2025 17:26

cyphar force-pushed the reset-cpu-affinity branch from a6dcf85 to b28358a Compare August 19, 2025 17:38

cyphar requested review from AkihiroSuda, kolyshkin, lifubang and rata August 19, 2025 17:39

cyphar force-pushed the reset-cpu-affinity branch from b28358a to 7cd11e6 Compare August 19, 2025 18:03

kolyshkin reviewed Aug 20, 2025

View reviewed changes

tests/integration/helpers.bash Outdated Show resolved Hide resolved

cyphar mentioned this pull request Aug 21, 2025

libctr: Reset the inherited cpu affinity #4041

Closed

cyphar force-pushed the reset-cpu-affinity branch from 7cd11e6 to e1d8c91 Compare August 21, 2025 01:35

cyphar force-pushed the reset-cpu-affinity branch 2 times, most recently from eda9b22 to c8e4019 Compare August 21, 2025 01:45

cyphar requested a review from lifubang August 27, 2025 13:09

cyphar added this to the 1.4.0-rc.1 milestone Aug 27, 2025

lifubang reviewed Aug 27, 2025

View reviewed changes

tests/integration/cpu_affinity.bats Outdated Show resolved Hide resolved

lifubang approved these changes Aug 27, 2025

View reviewed changes

cyphar force-pushed the reset-cpu-affinity branch from 857e938 to ae6d229 Compare August 27, 2025 16:58

cyphar added 3 commits August 28, 2025 08:23

cyphar force-pushed the reset-cpu-affinity branch from ae6d229 to 121192a Compare August 27, 2025 22:25

cyphar added the backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 label Aug 27, 2025

kolyshkin approved these changes Aug 28, 2025

View reviewed changes

cyphar merged commit cc8ab60 into opencontainers:main Aug 28, 2025
36 checks passed

cyphar deleted the reset-cpu-affinity branch August 28, 2025 00:54

cyphar mentioned this pull request Aug 28, 2025

[1.2] libct: reset CPU affinity by default #4869

Merged

cyphar mentioned this pull request Sep 8, 2025

x/sys/unix: add (*CPUSet).Fill helper to enable all CPUs golang/go#75186

Closed

EricMountain mentioned this pull request Sep 15, 2025

[v2.1-dd] Cherry-picks DataDog/containerd#17

Merged

dprodanov4 mentioned this pull request Oct 14, 2025

[3.4.1] update release notes suse-edge/suse-edge.github.io#900

Merged

askervin mentioned this pull request Nov 18, 2025

Resetting CPU affinity does the opposite on 1024+ CPU systems #5023

Open

cyphar mentioned this pull request Nov 19, 2025

libct: fix resetting CPU affinity #5025

Open

libct: reset CPU affinity by default #4858

libct: reset CPU affinity by default #4858

Uh oh!

Conversation

cyphar commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Aug 19, 2025

Uh oh!

cyphar commented Aug 19, 2025

Uh oh!

kolyshkin Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

cyphar Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kolyshkin commented Aug 20, 2025

Uh oh!

cyphar commented Aug 21, 2025

Uh oh!

cyphar commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Aug 21, 2025

Uh oh!

Uh oh!

kolyshkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brandond commented Aug 28, 2025

Uh oh!

cyphar commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cwayne18 commented Sep 2, 2025

Uh oh!

cyphar commented Sep 3, 2025

Uh oh!

brandond commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Sep 3, 2025

Uh oh!

e-minguez commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cyphar commented Aug 19, 2025 •

edited

Loading

cyphar commented Aug 21, 2025 •

edited

Loading

cyphar commented Aug 28, 2025 •

edited

Loading

brandond commented Sep 3, 2025 •

edited

Loading