Skip to content

Conversation

@cyphar
Copy link
Member

@cyphar cyphar commented Aug 19, 2025

This is a much simpler alternative to #4815 and includes more self-explanatory tests.
EDIT: #4041 is a similar approach but AFAICS it actually wouldn't work.


In certain deployments, it's possible for runc to be spawned by a
process with a restrictive cpumask (such as from a systemd unit with
CPUAffinity=... configured) which will be inherited by runc and thus the
container process by default.

The cpuset cgroup used to reconfigure the cpumask automatically for
joining processes, but torvalds/linux@da019032819a ("sched: Enforce user
requested affinity") changed this behaviour in Linux 6.2.

The solution is to try to emulate the expected behaviour by resetting
our cpumask to correspond with the configured cpuset (in the case of
runc exec, if the user did not configure an alternative one). Normally
we would have to parse /proc/stat and /sys/fs/cgroup, but luckily
sched_setaffinity(2) will transparently convert an all-set cpumask (even
if it has more entries than the number of CPUs on the system) to the
correct value for our usecase.

Closes #4815
Closes #4041
Reported-by: ningmingxiao ning.mingxiao@zte.com.cn
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

@cyphar cyphar mentioned this pull request Aug 19, 2025
@cyphar cyphar force-pushed the reset-cpu-affinity branch from 1e8b14f to 61891c9 Compare August 19, 2025 07:07
@cyphar
Copy link
Member Author

cyphar commented Aug 19, 2025

FYI, passing nil as done in MarSik@e6ce3af does not actually work (as I suspected). You need to provide a plausible-looking cpumask.

@cyphar cyphar force-pushed the reset-cpu-affinity branch 10 times, most recently from b0110a8 to a6dcf85 Compare August 19, 2025 17:26
@cyphar
Copy link
Member Author

cyphar commented Aug 19, 2025

Strangely, the rootless --systemd-cgroup tests appear to indicate that the cpuset is not applied to the cgroup (and so the affinity is reset to all CPUs). Maybe I'm misconfiguring the rootless cgroup stuff in the test, or it's some kind of race condition with systemd (though that seems unlikely) but it seems unrelated to this patch so I've just worked around it slightly.

# Wrapper for runc.
function runc() {
run __runc "$@"
# Wrapper around "run2 that logs output to make tests easier to debug.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, bats has --verbose-run now which kind of removes the need for such helper. I haven't tried that though, might not be that useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a look in a separate PR -- I added this wrapper back in 2016, given all the work since then I'm sure they've finally resolved this (fairly common) pain point by now.

@kolyshkin
Copy link
Contributor

There's a kernel patch that would help tremendously: https://lore.kernel.org/lkml/20220922180041.1768141-1-longman@redhat.com. With it, empty cpuset is treated as "set to all available cpus". Alas, it's not merged (but I think it is added to RHEL kernels).

It avoids somewhat slow calculations on every exec so the fix is way simpler (see #4041).

@cyphar
Copy link
Member Author

cyphar commented Aug 21, 2025

@kolyshkin AFAICS, that patchset was merged in Linux 6.2 (torvalds/linux@bf57ae2) -- I linked the merged version of patch 3 in the description (torvalds/linux@da019032819a).

As for using the empty set instead, it looks to me that the idea of #4041 was to use task_user_cpus from patch 1 (torvalds/linux@8f9ea86fdf99b). However, this is only used by force_compatible_cpus_allowed_ptr and relax_compatible_cpus_allowed_ptr which are only used by arm64 to force a particular CPU affinity.

AFAICS, using an empty CPU mask will just give you -EINVAL from __set_cpus_allowed_ptr and won't change the affinity:

static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
					 struct affinity_context *ctx,
					 struct rq *rq,
					 struct rq_flags *rf)
{
	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
	const struct cpumask *cpu_valid_mask = cpu_active_mask;
	bool kthread = p->flags & PF_KTHREAD;
	unsigned int dest_cpu;
	int ret = 0;

	/* ... */

	if (!kthread && !cpumask_subset(ctx->new_mask, cpu_allowed_mask)) {
		ret = -EINVAL;
		goto out;
	}

	/* ... */
}

(I also tested this locally and it doesn't work either, even if you ignore -EINVAL.)


Since unix.CPUSet is actually just an array, we could forcefully set all of the values to 0xFFFF_FFFF_FFFF_FFFF (unfortunately the size is architecture-specific and might be 32bit or 64bit -- and because it uses an internal type we can't use ^0 or generics to make the compiler happy).

That being said, I really don't think this is going to be a performance problem. The overhead we're talking about is 1us (we could cache it with sync.OnceValue but it only gets called once):

goos: linux
goarch: amd64
pkg: sched
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkCPUSet_Set1024
BenchmarkCPUSet_Set1024-16                        718982              1483 ns/op
BenchmarkCPUSet_Set1024-16                        711818              1433 ns/op
BenchmarkCPUSet_Set1024-16                        809596              1473 ns/op
BenchmarkCPUSet_Set1024-16                        798969              1507 ns/op
BenchmarkCPUSet_Set1024-16                        779511              1505 ns/op
BenchmarkCPUSet_Set1024_OnceValue
BenchmarkCPUSet_Set1024_OnceValue-16            174327886                6.811 ns/op
BenchmarkCPUSet_Set1024_OnceValue-16            168295525                7.161 ns/op
BenchmarkCPUSet_Set1024_OnceValue-16            166972314                7.128 ns/op
BenchmarkCPUSet_Set1024_OnceValue-16            170419455                7.075 ns/op
BenchmarkCPUSet_Set1024_OnceValue-16            174876843                6.901 ns/op
BenchmarkCPUSet_Manual0xFFFF
BenchmarkCPUSet_Manual0xFFFF-16                 277939233                4.388 ns/op
BenchmarkCPUSet_Manual0xFFFF-16                 265586181                4.411 ns/op
BenchmarkCPUSet_Manual0xFFFF-16                 275293765                4.323 ns/op
BenchmarkCPUSet_Manual0xFFFF-16                 272910668                4.408 ns/op
BenchmarkCPUSet_Manual0xFFFF-16                 272473363                4.272 ns/op
BenchmarkSetaffnity_1024
BenchmarkSetaffnity_1024-16                       463239              2256 ns/op
BenchmarkSetaffnity_1024-16                       457530              2353 ns/op
BenchmarkSetaffnity_1024-16                       459678              2337 ns/op
BenchmarkSetaffnity_1024-16                       469249              2338 ns/op
BenchmarkSetaffnity_1024-16                       460647              2272 ns/op
BenchmarkSetaffnity_Set1024_OnceValue
BenchmarkSetaffnity_Set1024_OnceValue-16         1522971               786.6 ns/op
BenchmarkSetaffnity_Set1024_OnceValue-16         1552382               773.9 ns/op
BenchmarkSetaffnity_Set1024_OnceValue-16         1553419               773.5 ns/op
BenchmarkSetaffnity_Set1024_OnceValue-16         1551712               781.7 ns/op
BenchmarkSetaffnity_Set1024_OnceValue-16         1552219               774.2 ns/op
BenchmarkSetaffnity_Manual0xFFFF
BenchmarkSetaffnity_Manual0xFFFF-16              1538379               776.8 ns/op
BenchmarkSetaffnity_Manual0xFFFF-16              1546282               778.9 ns/op
BenchmarkSetaffnity_Manual0xFFFF-16              1534801               779.8 ns/op
BenchmarkSetaffnity_Manual0xFFFF-16              1537387               776.9 ns/op
BenchmarkSetaffnity_Manual0xFFFF-16              1503338               790.1 ns/op
BenchmarkSetaffinity_Empty
BenchmarkSetaffinity_Empty-16                    1654030               724.1 ns/op
BenchmarkSetaffinity_Empty-16                    1583617               754.2 ns/op
BenchmarkSetaffinity_Empty-16                    1579324               756.2 ns/op
BenchmarkSetaffinity_Empty-16                    1597748               752.7 ns/op
BenchmarkSetaffinity_Empty-16                    1587476               750.4 ns/op
PASS
ok      sched   40.927s

@cyphar
Copy link
Member Author

cyphar commented Aug 21, 2025

Another alternative which doesn't require per-architecture code and is 10x faster is:

cpuset := unix.CPUSet{}
for i := range bits.UintSize {
	cpuset.Set(i)
}
for i := 1; i < len(cpuset); i *= 2 {
	copy(cpuset[i:], cpuset[:i])
}

Or even better (if you don't mind hacky solutions):

cpuset := unix.CPUSet{}
for i := range cpuset {
	cpuset[i]-- // underflow to 0xFF..FF
}

(Naive is just copying cpuset[0] to each value.)

goos: linux
goarch: amd64
pkg: sched
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkCPUSet_Set1024
BenchmarkCPUSet_Set1024-16                        688843              1497 ns/op
BenchmarkCPUSet_Set1024-16                        710368              1517 ns/op
BenchmarkCPUSet_Set1024-16                        709746              1513 ns/op
BenchmarkCPUSet_Set1024-16                        706272              1538 ns/op
BenchmarkCPUSet_Set1024-16                        774168              1478 ns/op
BenchmarkCPUSet_Underflow
BenchmarkCPUSet_Underflow-16                    235444515                5.031 ns/op
BenchmarkCPUSet_Underflow-16                    228036662                5.161 ns/op
BenchmarkCPUSet_Underflow-16                    239377489                5.005 ns/op
BenchmarkCPUSet_Underflow-16                    233470130                5.057 ns/op
BenchmarkCPUSet_Underflow-16                    230048280                5.208 ns/op
BenchmarkCPUSet_SetUintsize_CopyNaive
BenchmarkCPUSet_SetUintsize_CopyNaive-16        11296006               103.5 ns/op
BenchmarkCPUSet_SetUintsize_CopyNaive-16        10889908               108.9 ns/op
BenchmarkCPUSet_SetUintsize_CopyNaive-16        11091403               104.1 ns/op
BenchmarkCPUSet_SetUintsize_CopyNaive-16        11666511               105.4 ns/op
BenchmarkCPUSet_SetUintsize_CopyNaive-16        11597973               103.6 ns/op
BenchmarkCPUSet_SetUintsize_CopyBinary
BenchmarkCPUSet_SetUintsize_CopyBinary-16       12549202                94.91 ns/op
BenchmarkCPUSet_SetUintsize_CopyBinary-16       12601789                94.91 ns/op
BenchmarkCPUSet_SetUintsize_CopyBinary-16       12675290                95.59 ns/op
BenchmarkCPUSet_SetUintsize_CopyBinary-16       11692279               100.9 ns/op
BenchmarkCPUSet_SetUintsize_CopyBinary-16       11431554               102.7 ns/op

@cyphar
Copy link
Member Author

cyphar commented Aug 21, 2025

I pushed the hacky underflow solution.

@cyphar cyphar force-pushed the reset-cpu-affinity branch 2 times, most recently from eda9b22 to c8e4019 Compare August 21, 2025 01:45
@cyphar cyphar requested a review from lifubang August 27, 2025 13:09
@cyphar cyphar added this to the 1.4.0-rc.1 milestone Aug 27, 2025
@cyphar cyphar force-pushed the reset-cpu-affinity branch from 857e938 to ae6d229 Compare August 27, 2025 16:58
cyphar added 3 commits August 28, 2025 08:23
"runc" was a special wrapper around bats's "run" which output some very
useful diagnostic information to the bats log, but this was not usable
for other commands. So let's make it a more generic helper that we can
use for other commands.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Sometimes we need to run runc through some wrapper (like nohup), but
because "__runc" and "runc" are bash functions in our test suite this
doesn't work trivially -- and you cannot just pass "$RUNC" because you
you need to set --root for rootless tests.

So create a setup_runc_cmdline helper which sets $RUNC_CMDLINE to the
beginning cmdline used by __runc (and switch __runc to use that).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
In certain deployments, it's possible for runc to be spawned by a
process with a restrictive cpumask (such as from a systemd unit with
CPUAffinity=... configured) which will be inherited by runc and thus the
container process by default.

The cpuset cgroup used to reconfigure the cpumask automatically for
joining processes, but kcommit da019032819a ("sched: Enforce user
requested affinity") changed this behaviour in Linux 6.2.

The solution is to try to emulate the expected behaviour by resetting
our cpumask to correspond with the configured cpuset (in the case of
"runc exec", if the user did not configure an alternative one). Normally
we would have to parse /proc/stat and /sys/fs/cgroup, but luckily
sched_setaffinity(2) will transparently convert an all-set cpumask (even
if it has more entries than the number of CPUs on the system) to the
correct value for our usecase.

For some reason, in our CI it seems that rootless --systemd-cgroup
results in the cpuset (presumably temporarily?) being configured such
that sched_setaffinity(2) will allow the full set of CPUs. For this
particular case, all we care about is that it is different to the
original set, so include some special-casing (but we should probably
investigate this further...).

Reported-by: ningmingxiao <ning.mingxiao@zte.com.cn>
Reported-by: Martin Sivak <msivak@redhat.com>
Reported-by: Peter Hunt <pehunt@redhat.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
@cyphar cyphar force-pushed the reset-cpu-affinity branch from ae6d229 to 121192a Compare August 27, 2025 22:25
@cyphar cyphar added the backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 label Aug 27, 2025
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cyphar cyphar merged commit cc8ab60 into opencontainers:main Aug 28, 2025
36 checks passed
@cyphar cyphar deleted the reset-cpu-affinity branch August 28, 2025 00:54
@kolyshkin kolyshkin added backport/1.2-done A PR in main branch which has been backported to release-1.2 backport/1.3-done A PR in main branch which has been backported to release-1.3 and removed backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 backport/1.3-todo A PR in main branch which needs to be backported to release-1.3 labels Aug 28, 2025
@brandond
Copy link
Contributor

@cyphar any idea how far out the next 1.3 release is?

@cyphar
Copy link
Member Author

cyphar commented Aug 28, 2025

We can probably cut a release now, but I also want to get 1.4.0-rc1 ready soon as well. There's also at least one other PR for 1.3.1.

@cwayne18
Copy link

cwayne18 commented Sep 2, 2025

is there also a plan to backport this for a 1.2.7 release?

@cyphar
Copy link
Member Author

cyphar commented Sep 3, 2025

@cwayne18 It was already backported to the 1.2.x branch in #4869.

@brandond
Copy link
Contributor

brandond commented Sep 3, 2025

Awesome! Can we expect to see releases from these branches within the next day or so?

@cyphar
Copy link
Member Author

cyphar commented Sep 3, 2025

I will be sending release PRs this week.

@e-minguez
Copy link

@cwayne18 @brandond #4878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.2-done A PR in main branch which has been backported to release-1.2 backport/1.3-done A PR in main branch which has been backported to release-1.3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants