-
Notifications
You must be signed in to change notification settings - Fork 6
perf: Support the deferred unwinding infrastructure #5545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: bpf-next_base
Are you sure you want to change the base?
perf: Support the deferred unwinding infrastructure #5545
Conversation
Upstream branch: c4b1be9 |
Upstream branch: c4b1be9 |
5d84008
to
5b64e6f
Compare
8c57c04
to
b70eeea
Compare
Upstream branch: 26d0e53 |
5b64e6f
to
ceaf866
Compare
b70eeea
to
21b4b7a
Compare
Upstream branch: 0df1a55 |
ceaf866
to
03b7312
Compare
21b4b7a
to
7bebef6
Compare
Upstream branch: cce3fee |
03b7312
to
aaad424
Compare
7bebef6
to
098b57c
Compare
Upstream branch: 1230be8 |
aaad424
to
ae276cd
Compare
098b57c
to
04b85e4
Compare
Upstream branch: 212ec92 |
ae276cd
to
0d41d28
Compare
04b85e4
to
e254fff
Compare
Upstream branch: 621af19 |
0d41d28
to
10b9fe2
Compare
e254fff
to
6421a08
Compare
Upstream branch: 564606f |
10b9fe2
to
19d6076
Compare
6421a08
to
d84ad1e
Compare
Upstream branch: 38d95be |
19d6076
to
e335247
Compare
d84ad1e
to
7be6aa3
Compare
The 'init_nr' argument has double duty: it's used to initialize both the number of contexts and the number of stack entries. That's confusing and the callers always pass zero anyway. Hard code the zero. Acked-by: Namhyung Kim <Namhyung@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
… set get_perf_callchain() doesn't support cross-task unwinding for user space stacks, have it return NULL if both the crosstask and user arguments are set. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
…nt->mm == NULL To determine if a task is a kernel thread or not, it is more reliable to use (current->flags & (PF_KTHREAD|PF_USER_WORKERi)) than to rely on current->mm being NULL. That is because some kernel tasks (io_uring helpers) may have a mm field. Link: https://lore.kernel.org/linux-trace-kernel/20250424163607.GE18306@noisy.programming.kicks-ass.net/ Link: https://lore.kernel.org/all/20250624130744.602c5b5f@batman.local.home/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Simplify the get_perf_callchain() user logic a bit. task_pt_regs() should never be NULL. Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If the task is not a user thread, there's no user stack to unwind. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Use the new unwind_deferred_trace() interface (if available) to defer unwinds to task context. This will allow the use of .sframe (when it becomes available) and also prevents duplicate userspace unwinds. Suggested-by: Peter Zijlstra <peterz@infradead.org> Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The deferred unwinder works fine for task events (events that trace only a specific task), as it can use a task_work from an interrupt or NMI and when the task goes back to user space it will call the event's callback to do the deferred unwinding. But for per CPU events things are not so simple. When a per CPU event wants a deferred unwinding to occur, it cannot simply use a task_work as there's a many to many relationship. If the task migrates and another task is scheduled in where the per CPU event wants a deferred unwinding to occur on that task as well, and the task that migrated to another CPU has that CPU's event want to unwind it too, each CPU may need unwinding from more than one task, and each task may have requests from many CPUs. To solve this, when a per CPU event is created that has defer_callchain attribute set, it will do a lookup from a global list (unwind_deferred_list), for a perf_unwind_deferred descriptor that has the id that matches the PID of the current task's group_leader. If it is not found, then it will create one and add it to the global list. This descriptor contains an array of all possible CPUs, where each element is a perf_unwind_cpu descriptor. The perf_unwind_cpu descriptor has a list of all the per CPU events that is tracing the matching CPU that corresponds to its index in the array, where the events belong to a task that has the same group_leader. It also has a processing bit and rcuwait to handle removal. For each occupied perf_unwind_cpu descriptor in the array, the perf_deferred_unwind descriptor increments its nr_cpu_events. When a perf_unwind_cpu descriptor is empty, the nr_cpu_events is decremented. This is used to know when to free the perf_deferred_unwind descriptor, as when it becomes empty, it is no longer referenced. Finally, the perf_deferred_unwind descriptor has an id that holds the PID of the group_leader for the tasks that the events were created by. When a second (or more) per CPU event is created where the perf_deferred_unwind descriptor is already created, it just adds itself to the perf_unwind_cpu array of that descriptor. Updating the necessary counter. This is used to map different per CPU events to each other based on their group leader PID. Each of these perf_deferred_unwind descriptors have a unwind_work that registers with the deferred unwind infrastructure via unwind_deferred_init(), where it also registers a callback to perf_event_deferred_cpu(). Now when a per CPU event requests a deferred unwinding, it calls unwind_deferred_request() with the associated perf_deferred_unwind descriptor. It is expected that the program that uses this has events on all CPUs, as the deferred trace may not be called on the CPU event that requested it. That is, the task may migrate and its user stack trace will be recorded on the CPU event of the CPU that it exits back to user space on. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add a new event type for deferred callchains and a new callback for the struct perf_tool. For now it doesn't actually handle the deferred callchains but it just marks the sample if it has the PERF_CONTEXT_ USER_DEFFERED in the callchain array. At least, perf report can dump the raw data with this change. Actually this requires the next commit to enable attr.defer_callchain, but if you already have a data file, it'll show the following result. $ perf report -D ... 0x5fe0@perf.data [0x40]: event: 22 . . ... raw event: size 64 bytes . 0000: 16 00 00 00 02 00 40 00 02 00 00 00 00 00 00 00 ......@......... . 0010: 00 fe ff ff ff ff ff ff 4b d3 3f 25 45 7f 00 00 ........K.?%E... . 0020: 21 03 00 00 21 03 00 00 43 02 12 ab 05 00 00 00 !...!...C....... . 0030: 00 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00 ................ 0 24344920643 0x5fe0 [0x40]: PERF_RECORD_CALLCHAIN_DEFERRED(IP, 0x2): 801/801: 0 ... FP chain: nr:2 ..... 0: fffffffffffffe00 ..... 1: 00007f45253fd34b : unhandled! Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Upstream branch: 38d95be |
And add the missing feature detection logic to clear the flag on old kernels. $ perf record -g -vv true ... ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) size 136 config 0 (PERF_COUNT_HW_CPU_CYCLES) { sample_period, sample_freq } 4000 sample_type IP|TID|TIME|CALLCHAIN|PERIOD read_format ID|LOST disabled 1 inherit 1 mmap 1 comm 1 freq 1 enable_on_exec 1 task 1 sample_id_all 1 mmap2 1 comm_exec 1 ksymbol 1 bpf_event 1 defer_callchain 1 ------------------------------------------------------------ sys_perf_event_open: pid 162755 cpu 0 group_fd -1 flags 0x8 sys_perf_event_open failed, error -22 switching off deferred callchain support Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Handle the deferred callchains in the script output. $ perf script perf 801 [000] 18.031793: 1 cycles:P: ffffffff91a14c36 __intel_pmu_enable_all.isra.0+0x56 ([kernel.kallsyms]) ffffffff91d373e9 perf_ctx_enable+0x39 ([kernel.kallsyms]) ffffffff91d36af7 event_function+0xd7 ([kernel.kallsyms]) ffffffff91d34222 remote_function+0x42 ([kernel.kallsyms]) ffffffff91c1ebe1 generic_exec_single+0x61 ([kernel.kallsyms]) ffffffff91c1edac smp_call_function_single+0xec ([kernel.kallsyms]) ffffffff91d37a9d event_function_call+0x10d ([kernel.kallsyms]) ffffffff91d33557 perf_event_for_each_child+0x37 ([kernel.kallsyms]) ffffffff91d47324 _perf_ioctl+0x204 ([kernel.kallsyms]) ffffffff91d47c43 perf_ioctl+0x33 ([kernel.kallsyms]) ffffffff91e2f216 __x64_sys_ioctl+0x96 ([kernel.kallsyms]) ffffffff9265f1ae do_syscall_64+0x9e ([kernel.kallsyms]) ffffffff92800130 entry_SYSCALL_64+0xb0 ([kernel.kallsyms]) perf 801 [000] 18.031814: DEFERRED CALLCHAIN 7fb5fc22034b __GI___ioctl+0x3b (/usr/lib/x86_64-linux-gnu/libc.so.6) Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Save samples with deferred callchains in a separate list and deliver them after merging the user callchains. If users don't want to merge they can set tool->merge_deferred_callchains to false to prevent the behavior. With previous result, now perf script will show the merged callchains. $ perf script perf 801 [000] 18.031793: 1 cycles:P: ffffffff91a14c36 __intel_pmu_enable_all.isra.0+0x56 ([kernel.kallsyms]) ffffffff91d373e9 perf_ctx_enable+0x39 ([kernel.kallsyms]) ffffffff91d36af7 event_function+0xd7 ([kernel.kallsyms]) ffffffff91d34222 remote_function+0x42 ([kernel.kallsyms]) ffffffff91c1ebe1 generic_exec_single+0x61 ([kernel.kallsyms]) ffffffff91c1edac smp_call_function_single+0xec ([kernel.kallsyms]) ffffffff91d37a9d event_function_call+0x10d ([kernel.kallsyms]) ffffffff91d33557 perf_event_for_each_child+0x37 ([kernel.kallsyms]) ffffffff91d47324 _perf_ioctl+0x204 ([kernel.kallsyms]) ffffffff91d47c43 perf_ioctl+0x33 ([kernel.kallsyms]) ffffffff91e2f216 __x64_sys_ioctl+0x96 ([kernel.kallsyms]) ffffffff9265f1ae do_syscall_64+0x9e ([kernel.kallsyms]) ffffffff92800130 entry_SYSCALL_64+0xb0 ([kernel.kallsyms]) 7fb5fc22034b __GI___ioctl+0x3b (/usr/lib/x86_64-linux-gnu/libc.so.6) ... The old output can be get using --no-merge-callchain option. Also perf report can get the user callchain entry at the end. $ perf report --no-children --percent-limit=0 --stdio -q -S __intel_pmu_enable_all.isra.0 # symbol: __intel_pmu_enable_all.isra.0 0.00% perf [kernel.kallsyms] | ---__intel_pmu_enable_all.isra.0 perf_ctx_enable event_function remote_function generic_exec_single smp_call_function_single event_function_call perf_event_for_each_child _perf_ioctl perf_ioctl __x64_sys_ioctl do_syscall_64 entry_SYSCALL_64 __GI___ioctl Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
e335247
to
3f594cf
Compare
Pull request for series with
subject: perf: Support the deferred unwinding infrastructure
version: 12
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=977859