Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parca-agent triggers kernel bug because it calls bpf_probe_read_user() in the perf_event IRQ #1675

Closed
luisgerhorst opened this issue May 19, 2023 · 5 comments
Assignees
Labels
area/eBPF Something involving eBPF

Comments

@luisgerhorst
Copy link

luisgerhorst commented May 19, 2023

Describe the bug

Unfortunately, on some systems parca-agent seems to trigger a rare upstream kernel BUG because it calls bpf_probe_read_user() inside the perf_event IRQ. This is because bpf_probe_read_user() will call copy_from_user_nofault > access_ok > ... > find_vmap_area with some kernel configs (i.e., CONFIG_HARDENED_USERCOPY) which will attempt to acquire vmap_area_lock. If the interrupt occurred while the lock is held (e.g., during alloc_vmap_area() in the clone() syscall) find_vmap_area() will never return. This causes the lock held by clone() to never be released and any other CPU attempting to acquire it is locked up in an infinite loop. Ultimately, this happens on all CPUs and the whole machine is locked up.

To Reproduce

Start a machine using the affected upstream kernel code (tested with v6.1 but I believe the bug is also present in most other kernels). To reproduce it, you can for example use an AWS EC2 c6a.large (64 vCPUs) instance with the AMI al2023-ami-2023.0.20230503.0-kernel-6.1-x86_64. Having more CPUs allows the bug to be triggered more quickly.

$ curl -sL https://github.com/parca-dev/parca-agent/releases/download/v0.19.0/parca-agent_0.19.0_`uname -s`_`uname -m`.tar.gz | tar xvfz -
$ sudo ./parca-agent --node=test --remote-store-address=localhost:7070 --remote-store-insecure

To trigger the bug quickly, execute some code that will also use vmap_area_lock. For example, the clone() syscall:

$ while true ; do
ls -al > /dev/null # do not use true which is a shell builtin
done

Within 10 minutes, the CPU soft lockup messages should appear on the serial console.

Expected behavior

The machine is not locked up. BPF should not be able to lock up the machine but because of the kernel bug this happens anyway.

Logs

Here's an annotated log from the serial console. Other traces are also printed (from the other CPUs attempting to acquire the lock), however, this is the root cause I believe:

[253905.544838] Sending NMI from CPU 27 to CPUs 55:
[253905.545371] NMI backtrace for cpu 55
[253905.545375] CPU: 55 PID: 3316 Comm: spawn Tainted: G             L     6.1.25-37.47.amzn2023.x86_64 #1
[253905.545377] Hardware name: Amazon EC2 c6a.16xlarge/, BIOS 0 10/16/2017
[253905.545378] RIP: 0010:native_queued_spin_lock_slowpath+0x32/0x2c0
[253905.545384] Code: 54 55 48 89 fd 53 66 90 ba 01 00 00 00 8b 45 00 85 c0 75 14 f0 0f b1 55 00 85 c0 75 f0 5b 5d 41 5c 41 5d c3 cc cc cc cc f3 90 <eb> e1 81 fe 00 01 00 00 74 50 40 30 f6 85 f6 75 73 f0 0f ba 6d 00
[253905.545385] RSP: 0018:ffffc3edc6e68bc0 EFLAGS: 00000002
[253905.545387] RAX: 0000000000000001 RBX: ffffffffa1777ccc RCX: 0000000000000010
[253905.545388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffffa1777ccc
[253905.545388] RBP: ffffffffa1777ccc R08: 0000000000000001 R09: 000004c6af4181a9
[253905.545389] R10: 0000000000000000 R11: ffffc3edc6e68ff8 R12: 0000000000000008
[253905.545390] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000080000000
[253905.545393] FS:  00007fd4a28d8600(0000) GS:ffffa057e99c0000(0000) knlGS:0000000000000000
[253905.545394] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[253905.545395] CR2: 00000000004040b0 CR3: 00000002461a8001 CR4: 00000000003706e0
[253905.545398] Call Trace:
[253905.545399]  <IRQ>
#
#
# https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L1861
#
[253905.545401]  _raw_spin_lock+0x30/0x40
[253905.545403]  find_vmap_area+0x17/0x60
#
#
# Likely requires https://elixir.bootlin.com/linux/v6.1.28/K/ident/CONFIG_HARDENED_USERCOPY
#
[253905.545407]  check_heap_object+0xd4/0x150
[253905.545409]  __check_object_size.part.0+0x47/0xd0
#
#
# This does pagefault_disable() (like perf_callchain_user()), which should make the actual copy IRQ-safe.
#
# But it calls access_ok() before pagefault_disable(), which is apparently not IRQ-safe.
# https://elixir.bootlin.com/linux/v6.1.28/source/arch/x86/include/asm/uaccess.h#L41
#
[253905.545411]  copy_from_user_nofault+0x65/0x90
[253905.545413]  bpf_probe_read_user+0x18/0x50
[253905.545416]  bpf_prog_2448819a7219e528_profile_cpu+0x354/0x9fd
[253905.545421]  bpf_overflow_handler+0xad/0x170
[253905.545424]  __perf_event_overflow+0x102/0x1e0
[253905.545426]  ? __perf_event_overflow+0x1e0/0x1e0
[253905.545427]  perf_swevent_hrtimer+0x12b/0x140
[253905.545430]  ? update_load_avg+0x7e/0x740
[253905.545433]  ? enqueue_entity+0x1b2/0x520
[253905.545435]  __hrtimer_run_queues+0x112/0x2b0
[253905.545439]  hrtimer_interrupt+0x106/0x220
[253905.545442]  __sysvec_apic_timer_interrupt+0x7f/0x170
[253905.545445]  sysvec_apic_timer_interrupt+0x9d/0xd0
[253905.545448]  </IRQ>
[253905.545449]  <TASK>
[253905.545449]  asm_sysvec_apic_timer_interrupt+0x16/0x20
[253905.545452] RIP: 0010:insert_vmap_area.constprop.0+0x34/0x120
[253905.545453] Code: 4b 03 41 55 41 54 55 53 48 89 fb 48 85 c0 0f 84 d3 00 00 00 4c 8b 4f 08 eb 10 48 8b 48 10 48 8d 50 10 48 85 c9 74 29 48 8b 02 <48> 8b 48 f0 49 39 c9 76 e7 48 8b 33 40 f8 4c 39 c6 0f 82 88
[253905.545454] RSP: 0018:ffffc3ede3b23bf8 EFLAGS: 00000282
[253905.545455] RAX: ffffa039ec903d10 RBX: ffffa039ec9030c0 RCX: ffffa039ec903d10
[253905.545456] RDX: ffffa0492d825520 RSI: ffffc3edf0f08000 RDI: ffffa039ec9030c0
[253905.545456] RBP: ffffa048c77e8400 R08: ffffc3edf0efd000 R09: ffffc3edf0f0d000
[253905.545457] R10: ffffc3edf0f05000 R11: 0000000000036b00 R12: 0000000000005000
[253905.545458] R13: 0000000000003fff R14: ffffa039ec9030c0 R15: ffffc3edc0000000
#
#
# https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L1634
#
[253905.545460]  alloc_vmap_area+0x330/0x820
[253905.545463]  __get_vm_area_node+0xb8/0x170
[253905.545464]  __vmalloc_node_range+0xa6/0x220
[253905.545466]  ? dup_task_struct+0x57/0x1a0
[253905.545470]  alloc_thread_stack_node+0xcd/0x130
[253905.545472]  ? dup_task_struct+0x57/0x1a0
[253905.545474]  dup_task_struct+0x57/0x1a0
[253905.545476]  copy_process+0x1bd/0x15c0
[253905.545479]  kernel_clone+0x9b/0x3b0
[253905.545482]  __do_sys_clone+0x66/0x90
[253905.545485]  do_syscall_64+0x3b/0x90
[253905.545487]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[253905.545489] RIP: 0033:0x7fd4a2718a27
[253905.545490] Code: 00 00 00 f3 0f 1e fa 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 39 41 89 c0 85 c0 75 2a 64 48 8b 04 25 10 00
[253905.545491] RSP: 002b:00007ffc648c1158 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[253905.545492] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd4a2718a27
[253905.545493] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[253905.545494] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[253905.545494] R10: 00007fd4a28d88d0 R11: 0000000000000246 R12: 0000000000000000
[253905.545495] R13: 00000000004010b0 R14: 0000000000403e00 R15: 00007fd4a2914000
[253905.545497]  </TASK>

Software (please complete the following information):

  • Parca Agent Version: v0.19.0, also tested git tree from last week
  • Parca Server Version (if applicable): NA

Workload (please complete the following information):

  • Runtime (if applicable):
  • Compiler (if applicable):

Environment (please complete the following information):

  • Linux Distribution (tested on the following, others are likely also affected):
$ cat /etc/*-release
Amazon Linux release 2023 (Amazon Linux)
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
SUPPORT_END="2028-03-01"
Amazon Linux release 2023 (Amazon Linux)
  • Linux Version: 6.1.25-37.47.amzn2023.x86_64
  • Arch: x86_64
  • Kubernetes Version (if applicable): NA
  • Container Runtime (if applicable): NA

Additional context

I believe this is neither a bug in Amazon Linux nor in Parca, but a upstream kernel bug. I have not reported it upstream yet (you are free to do it yourself, it would be great if you CC gerhorst@amazon.de and linux-kernel@luisgerhorst.de if you do). I was not able to find an existing report on LKML. I am reporting this here because parca-agent is affected and you will likely want to change your BPF program even if the bug is fixed upstream (as it will take time for the fix to propagate).

The best fix for you is likely to stop using the BPF helper for now. Maybe you can also detect the specific conditions that trigger the bug and only avoid calling the helper when these are present.

To fix the kernel bug, it's maybe possible to disable IRQs during alloc_vmap_area() and similar or to make access_ok() IRQ-safe.

@luisgerhorst luisgerhorst changed the title parca-agent triggers kernel BUG because it calls bpf_probe_read_user() in the perf_event IRQ parca-agent triggers kernel bug because it calls bpf_probe_read_user() in the perf_event IRQ May 19, 2023
@kakkoyun kakkoyun added the area/eBPF Something involving eBPF label May 22, 2023
@kakkoyun
Copy link
Member

kakkoyun commented May 22, 2023

@luisgerhorst Thanks for reporting 👍 Let us discuss our options, and we will update here.

@javierhonduco
Copy link
Contributor

Thanks for the detailed bug report! I agree with you that this issue lies in the kernel. BPF execution should always be safe, so it should never lead to kernel panics / oops.

We can't stop using bpf_probe_read_user as it's at the heart of what we need to do -- reading memory locations so we can unwind different runtimes.

Found a recent patch (https://lore.kernel.org/bpf/202301190848.D0543F7CE@keescook/T/#mf4a2a97bb0a4cdc13eff7a1f8f5d25ea594263c2
) that mentions exactly this issue we are seeing:

  • __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling
    check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock()
    which is not safe to do from BPF, [ke]probe and perf due to potential deadlock.

Let us know if you can give it a try!

@javierhonduco
Copy link
Contributor

javierhonduco commented May 31, 2023

Will leave this issue to track backports of the fix / bugs opened in different distros:

javierhonduco added a commit that referenced this issue Sep 28, 2023
Releases >=5.19 && <6.1 have a pretty bad kernel bug that can result in
whole sytem lock ups that can only be fixed with a reboot
(https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398).

The fix got backported to -stable
(https://www.spinics.net/lists/stable/msg662452.html, and
https://www.spinics.net/lists/stable/msg662218.html for 6.1 and 6.3
respectively).

Let's not run the Agent in these kernels, but provide a flag to bypass
this check. Note that running a buggy kernel can result in your machine
going down.

Related issue: #1675

Test Plan
=========

Tested locally + added unit tests
@szuecs
Copy link

szuecs commented Oct 11, 2023

I am pretty sure we hit the same bug. I could not get a console output from AWS, but we run c6g.8xlarge in a test and get every 10-30m a machine freeze.

Kernel: 6.2.0-1009-aws

# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

@javierhonduco
Copy link
Contributor

Closing as now we have a check to not run on these kernels by default. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eBPF Something involving eBPF
Projects
None yet
Development

No branches or pull requests

4 participants