Skip to content

Conversation

@ningmingxiao
Copy link
Contributor

@ningmingxiao ningmingxiao commented Jul 16, 2025

Old kernels do that automatically, but new kernels remember the affinity that was set before the cgroup move due to

https://lore.kernel.org/lkml/20220922180041.1768141-1-longman@redhat.com

This is undesirable for containers, because they inherit the systemd affinity when the should really move to the container space cpus.

ref:#4041
this patch https://lore.kernel.org/lkml/20231003205735.2921964-1-longman@redhat.com/ doesn't merge in kernel but merged in redhat I want to find another way to fix it.

@AkihiroSuda
Copy link
Member

What was the issue?
How to test this PR?

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AkihiroSuda +1, it would be great if we can have some test for this.

@haircommander
Copy link
Contributor

hmm I wonder if we can gather the cpuset from the container spec rather than inheriting what runc is being called with...

@ningmingxiao
Copy link
Contributor Author

I try to get it form cpuset but default is empty.

@ningmingxiao
Copy link
Contributor Author

ningmingxiao commented Jul 17, 2025

I add an if,it only happen on linux.

if runtime.GOOS == "linux" {
}

@ningmingxiao ningmingxiao changed the title fix:cpu cpu affinity fix: cpu affinity Jul 17, 2025
@ningmingxiao ningmingxiao force-pushed the fix_annify branch 2 times, most recently from cab02c9 to d72f7b9 Compare July 17, 2025 04:41
Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left several comments, but I guess we want also tests in tests/integration :)

break
}
line := string(data)
if line[:3] != "cpu" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment that all cpu* lines are at the beginning belongs here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you mark this as solved? Am I missing something?

if line[:3] != "cpu" {
break
}
if '0' <= line[3] && line[3] <= '9' {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need a commet explaining that there is a "cpu" line that we should ignore and only count cpu lines followed by a number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

Comment on lines 257 to 269
if err := setAffinityAll(p.pid()); err != nil {
return err
}
// Set final CPU affinity right after the process is moved into container's cgroup.
if err := p.setFinalCPUAffinity(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I ignore this, but why do we need to set the affinity to all cpus and then to what is in the config? I don't follow from the link in the description why this step is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this marked as solved? Did I miss something?

@ningmingxiao ningmingxiao force-pushed the fix_annify branch 2 times, most recently from ec7d0c0 to 2904639 Compare July 17, 2025 13:04
@rata
Copy link
Member

rata commented Jul 17, 2025

@ningmingxiao ping when this is ready for another round of reviews (and the tests are green, hopefully :))

@ningmingxiao ningmingxiao force-pushed the fix_annify branch 3 times, most recently from be17d50 to 81e87d8 Compare July 18, 2025 02:46
cpuset := unix.CPUSet{}
for i := 0; i < int(cpus); i++ {
cpuset.Set(i)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this tested with nested containers?
cpus here can be different from the number of the available cpus

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I think the code in this PR will only work if you use lxcfs with nested containers (which nobody does with runc).

I suspect that we would instead need to parse /proc/self/cgroup and then look at the CPU set in /sys/fs/cgroup/cpuset.cpus.effective (but we would also need to check any parent cgroups if cpuset is not in cgroup.controllers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, would just passing nil as described in MarSik@e6ce3af just work?

I believe this is trying to take advantage of the EINVAL error fallback of __sched_setaffinity (which does reset the affnitiy back to the cpuset if there is no overlap between the cpuset and the requested affinity) but I'm not sure it actually works. My reading of __set_cpus_allowed_ptr gives me the impression that this shouldn't work, but the linked commit claims this resolves this issue?

Copy link
Member

@cyphar cyphar Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually, sched_setaffinity silently clamps the cpuset you give based on the cpuset for the task. So this is fine.

However, I would like to know if nil works just as well -- less code is better.

In fact, it might be even simpler to just generate a set of 8192 CPUs and get the kernel to clamp it for us? The kernel automatically clamps the size of cpumask to nr_cpu_ids internally so even if you give a really large number they will happily ignore it.

EDIT: Testing this, it seems golang.org/x/sys/unix will silently truncate the cpuset to 1024 CPUs. They have a hardcoded limit of _CPU_SETSIZE.


- name: integration test (systemd driver)
run: |
sudo taskset -pc 0-1 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be moved to the bats script

@rata
Copy link
Member

rata commented Jul 21, 2025

@ningmingxiao if you can ping here when it's ready for review it would be great. Marking conversations as solved in github doesn't send any notification. And it's hard to know when it's ready for review again if you don't say anything or request another review in github.

@ningmingxiao
Copy link
Contributor Author

ningmingxiao commented Jul 21, 2025

ok ! it can be reviewed now, thanks @rata @AkihiroSuda

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left a few more comments. But there are several open comments already, please have a look :)

return nil
}
if err := unix.SchedSetaffinity(p.pid(), aff.Final); err != nil {
if err := unix.SchedSetaffinity(p.pid(), p.config.CPUAffinity.Final); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the nil check not needed anymore?

@ningmingxiao ningmingxiao force-pushed the fix_annify branch 3 times, most recently from 3542794 to 5b2dff7 Compare August 15, 2025 07:25
Signed-off-by: ningmingxiao <ning.mingxiao@zte.com.cn>
@ningmingxiao
Copy link
Contributor Author

ping @rata

@cyphar
Copy link
Member

cyphar commented Aug 19, 2025

After looking at this, I realised that there is a much simpler way of doing this -- I've carried a version in #4858.

@ningmingxiao Can you double-check that the patch I've posted resolves your issue as well?

@wwcd
Copy link

wwcd commented Aug 27, 2025

After looking at this, I realised that there is a much simpler way of doing this -- I've carried a version in #4858.

@ningmingxiao Can you double-check that the patch I've posted resolves your issue as well?

It works well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants