-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws: kola podman.base tests failing with kernel warning percpu_ref_switch_to_atomic_rcu+0x12f/0x140 #507
Comments
This happens on shutdown. Kernel is complaining about systemd-manager and cgroup-bpf:
The full warning is:
|
The kernel stacktrace seems to be going through some Xen-specific callsites, so it sounds legit to me that the CI only hits this on AWS but not on other platforms. /cc @davdunc |
Forwarded to Fedora bugzilla at https://bugzilla.redhat.com/show_bug.cgi?id=1843546. |
Note also the reason these Xen calls are here is because kola defaults to Hopefully most people using FCOS are using |
m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507
I went back through our CI tests for our
The diff in that update seems to be:
So the Added an update to the BZ with this info: https://bugzilla.redhat.com/show_bug.cgi?id=1843546#c1 |
m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507
I was testing a switch to
|
Another possibly related instance of this is:
Though according to Dusty, that doesn't happen on |
Doing a web search turns up a few possible causes; e.g. this commit for io_uring. But I don't think anything in default FCOS uses io_uring yet (though it'd be interesting to find out). Also this bug. And something interesting about that bug is it mentions NVMe, which we also use in AWS, but not in other cloud platforms AFAIK (e.g. GCP uses virtio). |
m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507
It seems like this message is the one we're getting now from our CI runs (still using m4 instance type).
|
Actually this stack track is related to the code in the kernel that manages BPF programs attached to cgroups. The stack trace references cgroup_bpf_release_fn() from kernel/bpf/cgroup.c. I'm still looking at what changed between 5.6.13 and 5.6.14 in fc32. |
Before I try reverting some patches in 5.6.14, I need a way to reliably reproduce this issue. On my server, this happens during system shutdown, but it doesn't occur reliably. Since June 20th, I have 5 of these errors in my system log files. percpu ref (cgroup_bpf_release_fn) <= 0 (-1) after switching to atomic My server has a fixed low-volume workload running 10 containers that are started with podman by systemd. Last evening, the error happened on reboot when the server was up for only 30 minutes. I let it run overnight, rebooted it this morning, and no error on restart. I'm not that familiar with this part of the kernel, however I believe the way to trigger this is with a BPF program that attaches with any of the attachment types that list BPF_CGROUP_* from here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/bpf.h#n194. Perhaps this is triggered by podman or crun polling container networking stats through BPF? I am able to run arbitrary programs (like execsnoop) from the bcc-tools package on fcos inside a container with these commands:
|
@masneyb I want to emphasize - thank you so much for diving in and looking at this. As far as BPF + cgroups I think it's much more likely to be systemd doing this by default for its services. See e.g.
|
Two days ago I tried to reproduce this on my fcos box with the failed test case listed at https://github.com/coreos/coreos-assembler/blob/master/mantle/kola/tests/podman/podman.go#L249-L284 since that's what's failing CI. I still couldn't reliably reproduce the crash, even after rebooting after running that failed test. (Disclaimer: I copy and pasted those commands into a bash script and didn't run the go test suite.) @cgwalters : I verified that systemd uses BPF_CGROUP_INET_EGRESS and BPF_CGROUP_INET_INGRESS in the file https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c. If someone is able to come up with a reliable way to trigger this error, then I'll be happy to dig into this further. |
There's the possibility that this issue may be fixed by the upstream commit torvalds/linux@94886c8. |
We're still seeing the Here is the full stack trace:
@masneyb should we open a new bug for |
@dustymabe : That's a separate issue in the suspend/resume path and will need a separate BZ. That's good to hear the cgroup_bpf_release_fn() error no longer occurs at AWS. My home server is on the stable channel running 5.7.10-201.fc32.x86_64 and I still see the cgroup_bpf_release_fn() error come through periodically. The last time was six days ago. I'm pretty busy at work the next two weeks but after that I'll try the new kernel there to see if the cgroup_bpf_release_fn() BZ can be closed. |
opened #606 and BZ#1870209 for the new kernel issue. Assuming we don't see the percpu_ref_switch_to_atomic_rcu+0x12f/0x140 issue any more I'll close this bug out. |
Not seeing this any longer. |
m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507
m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507
This started failing in the past few days. For example from the
32.20200528.20.1
build:The same tests pass on QEMU and GCP. We'll need to dig in to see what the issue is.
The text was updated successfully, but these errors were encountered: