-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parca-agent triggers kernel bug because it calls bpf_probe_read_user() in the perf_event IRQ #1675
Comments
@luisgerhorst Thanks for reporting 👍 Let us discuss our options, and we will update here. |
Thanks for the detailed bug report! I agree with you that this issue lies in the kernel. BPF execution should always be safe, so it should never lead to kernel panics / oops. We can't stop using Found a recent patch (https://lore.kernel.org/bpf/202301190848.D0543F7CE@keescook/T/#mf4a2a97bb0a4cdc13eff7a1f8f5d25ea594263c2
Let us know if you can give it a try! |
Will leave this issue to track backports of the fix / bugs opened in different distros: |
Releases >=5.19 && <6.1 have a pretty bad kernel bug that can result in whole sytem lock ups that can only be fixed with a reboot (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398). The fix got backported to -stable (https://www.spinics.net/lists/stable/msg662452.html, and https://www.spinics.net/lists/stable/msg662218.html for 6.1 and 6.3 respectively). Let's not run the Agent in these kernels, but provide a flag to bypass this check. Note that running a buggy kernel can result in your machine going down. Related issue: #1675 Test Plan ========= Tested locally + added unit tests
I am pretty sure we hit the same bug. I could not get a console output from AWS, but we run c6g.8xlarge in a test and get every 10-30m a machine freeze. Kernel: 6.2.0-1009-aws
|
Closing as now we have a check to not run on these kernels by default. Thanks a lot! |
Describe the bug
Unfortunately, on some systems parca-agent seems to trigger a rare upstream kernel BUG because it calls
bpf_probe_read_user()
inside the perf_event IRQ. This is becausebpf_probe_read_user()
will callcopy_from_user_nofault > access_ok > ... > find_vmap_area
with some kernel configs (i.e.,CONFIG_HARDENED_USERCOPY
) which will attempt to acquirevmap_area_lock
. If the interrupt occurred while the lock is held (e.g., duringalloc_vmap_area()
in theclone()
syscall)find_vmap_area()
will never return. This causes the lock held byclone()
to never be released and any other CPU attempting to acquire it is locked up in an infinite loop. Ultimately, this happens on all CPUs and the whole machine is locked up.To Reproduce
Start a machine using the affected upstream kernel code (tested with v6.1 but I believe the bug is also present in most other kernels). To reproduce it, you can for example use an AWS EC2
c6a.large
(64 vCPUs) instance with the AMIal2023-ami-2023.0.20230503.0-kernel-6.1-x86_64
. Having more CPUs allows the bug to be triggered more quickly.To trigger the bug quickly, execute some code that will also use
vmap_area_lock
. For example, theclone()
syscall:Within 10 minutes, the CPU soft lockup messages should appear on the serial console.
Expected behavior
The machine is not locked up. BPF should not be able to lock up the machine but because of the kernel bug this happens anyway.
Logs
Here's an annotated log from the serial console. Other traces are also printed (from the other CPUs attempting to acquire the lock), however, this is the root cause I believe:
Software (please complete the following information):
Workload (please complete the following information):
Environment (please complete the following information):
6.1.25-37.47.amzn2023.x86_64
x86_64
Additional context
I believe this is neither a bug in Amazon Linux nor in Parca, but a upstream kernel bug. I have not reported it upstream yet (you are free to do it yourself, it would be great if you CC gerhorst@amazon.de and linux-kernel@luisgerhorst.de if you do). I was not able to find an existing report on LKML. I am reporting this here because parca-agent is affected and you will likely want to change your BPF program even if the bug is fixed upstream (as it will take time for the fix to propagate).
The best fix for you is likely to stop using the BPF helper for now. Maybe you can also detect the specific conditions that trigger the bug and only avoid calling the helper when these are present.
To fix the kernel bug, it's maybe possible to disable IRQs during
alloc_vmap_area()
and similar or to makeaccess_ok()
IRQ-safe.The text was updated successfully, but these errors were encountered: