Skip to content

Conversation

@cyphar
Copy link
Member

@cyphar cyphar commented Jul 31, 2025

openSUSE has an unfortunate default udev setup which forcefully sets all
loop devices to use the "none" scheduler, even if you manually set it.
As this is a property of the host configuration (and udev is monitoring
from the host) we cannot really change this behaviour from inside our
test container.

So we should just skip the test in this (hopefully unusual) case.
Ideally tools running the test suite should disable this behaviour when
running our test suite.

Fixes #4781
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

@cyphar cyphar force-pushed the test-bfq-policy branch 2 times, most recently from 46218c8 to 357318f Compare July 31, 2025 06:13
@ricardobranco777
Copy link

ricardobranco777 commented Jul 31, 2025

This patch seems to work on x86_64 for Tumbleweed but on aarch64 I'm still seeing this:

https://openqa.opensuse.org/tests/5209568/file/runc-runc-root.tap

# runc run -d --console-socket /tmp/bats-run-jbQ2hk/runc.7sPUsq/tty/sock test_dev_weight (status=1):
# time="2025-07-31T03:51:17-04:00" level=error msg="runc run failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: setting device weight \"7:0 444\": write /sys/fs/cgroup/machine.slice/runc-cgroups-integration-test-19843.scope/io.bfq.weight: operation not supported"

On SLES 16.0 I still see it on both arches.

https://openqa.suse.de/tests/18613516/file/runc-runc-root.tap

# runc run -d --console-socket /tmp/bats-run-Hso6jF/runc.E25m9g/tty/sock test_dev_weight (status=1):
# time="2025-07-31T09:53:27+02:00" level=error msg="runc run failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: setting device weight \"7:0 444\": write /sys/fs/cgroup/machine.slice/runc-cgroups-integration-test-24605.scope/io.bfq.weight: operation not supported"
# --- teardown ---
# losetup -d '/dev/loop0'

@cyphar
Copy link
Member Author

cyphar commented Jul 31, 2025

@ricardobranco777 Did you apply both patches? Unfortunately, there could be a race with udev where it sets the scheduler back after we checked it. Not sure if there is a better solution than modifying the host config, to be honest...

On my Tumbleweed machine, I haven't managed to hit that race yet though...

@ricardobranco777
Copy link

@ricardobranco777 Did you apply both patches?

Yes.

Unfortunately, there could be a race with udev where it sets the scheduler back after we checked it. Not sure if there is a better solution than modifying the host config, to be honest...

Ok. I'll look into that instead. Thanks!

@cyphar
Copy link
Member Author

cyphar commented Jul 31, 2025

Does it fail consistently even with the patches applied? The patch should just cause the problematic test to get skipped if udev is silently changing the scheduler...

If you have actual access to the OpenQA box, there is a bpftrace script from the issue that will tell us who is changing the scheduler and when.

I can check the qcows myself later if I have some time.

@ricardobranco777
Copy link

Does it fail consistently even with the patches applied?

No.

If you have actual access to the OpenQA box, there is a bpftrace script from the issue that will tell us who is changing the scheduler and when.

I can check the qcows myself later if I have some time.

I'm applying the patch to SLES 15-SP4+ & Tumbleweed here:

os-autoinst/os-autoinst-distri-opensuse#22825

@cyphar
Copy link
Member Author

cyphar commented Aug 2, 2025

@ricardobranco777 Can you try it again with the sleep 2s version of the patch?

cyphar added 2 commits August 2, 2025 20:01
If an error occurs during a test which sets up loopback devices, the
loopback device is not freed. Since most systems have very conservative
limits on the number of loopback devices, re-running a failing test
locally to debug it often ends up erroring out due to loopback device
exhaustion.

So let's just move the "losetup -d" to teardown, where it belongs.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
openSUSE has an unfortunate default udev setup which forcefully sets all
loop devices to use the "none" scheduler, even if you manually set it.
As this is a property of the host configuration (and udev is monitoring
from the host) we cannot really change this behaviour from inside our
test container.

So we should just skip the test in this (hopefully unusual) case.
Ideally tools running the test suite should disable this behaviour when
running our test suite.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
@ricardobranco777
Copy link

ricardobranco777 commented Aug 2, 2025

@ricardobranco777 Can you try it again with the sleep 2s version of the patch?

Sure. I just cloned openQA jobs and the links are available in this PR description.

os-autoinst/os-autoinst-distri-opensuse#22825

@ricardobranco777
Copy link

@ricardobranco777 Can you try it again with the sleep 2s version of the patch?

It works. Now I can't unignore cgroups.bats in our tests. Thanks!

@cyphar
Copy link
Member Author

cyphar commented Aug 4, 2025

/ping @kolyshkin

@ricardobranco777 says the sleep 2s approach you suggested fixes the issue, so this should be good to merge.

@ricardobranco777
Copy link

Successfully tested on:

  • s390x for SLES only (runc 1.2.6)
  • ppc64le for SLES & Tumbleweed (runc 1.3.0)

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (nit: second commit description doesn't mention sleep)

@kolyshkin kolyshkin enabled auto-merge August 4, 2025 23:34
@kolyshkin
Copy link
Contributor

@rata @lifubang PTAL

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this @kolyshkin @cyphar !

This LGTM, but left a comment that I think would be slightly better. If you don't agree, feel free to ignore it and merge :)

# usually triggered by the "change" event from losetup, we can wait for a
# little bit before continuing the test. For more details, see
# <https://github.com/opencontainers/runc/issues/4781>.
sleep 2s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this. I wonder if doing the sleep only if the distro is suse is better, though.

That way we don't affect any other platform (IIRC github actions is not running suse at all) and we do notice if any other platform has this behavior, and we can decide to skip it in that platform too.

@kolyshkin kolyshkin merged commit 67112aa into opencontainers:main Aug 5, 2025
31 checks passed
@rata
Copy link
Member

rata commented Aug 5, 2025

Oh, auto-merge was enabled :-D

@rata rata mentioned this pull request Aug 5, 2025
@rata
Copy link
Member

rata commented Aug 5, 2025

Created #4838

@cyphar cyphar deleted the test-bfq-policy branch August 5, 2025 16:29
@kolyshkin kolyshkin added the backport/1.3-done A PR in main branch which has been backported to release-1.3 label Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.3-done A PR in main branch which has been backported to release-1.3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cgroups test fails with "io.bfq.weight: operation not supported"

4 participants