-
Notifications
You must be signed in to change notification settings - Fork 146
tools: bpftool: support creating and dumping outer maps #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(hash-of-maps or array-of-maps), bpftool does not allow to do so. It seems that the only reason for that is historical. Lookups for outer maps was added in commit 14dc6f0 ("bpf: Add syscall lookup support for fd array and htab"), and although the relevant code in bpftool had not been merged yet, I suspect it had already been written with the assumption that user space could not read outer maps. Let's remove the restriction, dump for outer maps works with no further change. Reported-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Quentin Monnet <quentin@isovalent.com> --- tools/bpf/bpftool/map.c | 4 ---- 1 file changed, 4 deletions(-)
hash-of-map in bpftool. This is because the kernel needs an inner_map_fd
to collect metadata on the inner maps to be supported by the new map,
but bpftool does not provide a way to pass this file descriptor.
Add a new optional "inner_map" keyword that can be used to pass a
reference to a map, retrieve a fd to that map, and pass it as the
inner_map_fd.
Add related documentation and bash completion. Note that we can
reference the inner map by its name, meaning we can have several times
the keyword "name" with different meanings (mandatory outer map name,
and possibly a name to use to find the inner_map_fd). The bash
completion will offer it just once, and will not suggest "name" on the
following command:
# bpftool map create /sys/fs/bpf/my_outer_map type hash_of_maps \
inner_map name my_inner_map [TAB]
Fixing that specific case seems too convoluted. Completion will work as
expected, however, if the outer map name comes first and the "inner_map
name ..." is passed second.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
---
.../bpf/bpftool/Documentation/bpftool-map.rst | 10 +++-
tools/bpf/bpftool/bash-completion/bpftool | 22 ++++++++-
tools/bpf/bpftool/map.c | 48 +++++++++++++------
3 files changed, 62 insertions(+), 18 deletions(-)
|
Master branch: 95cec14 patch https://patchwork.ozlabs.org/project/netdev/patch/20200904161313.29535-2-quentin@isovalent.com/ applied successfully |
|
At least one diff in series https://patchwork.ozlabs.org/project/netdev/list/?series=199591 expired. Closing PR. |
error likes: error: progs/test_sysctl_loop1.c:23:16: in function sysctl_tcp_mem i32 (%struct.bpf_sysctl*): Looks like the BPF stack limit of 512 bytes is exceeded. Please move large on stack variables into BPF per-cpu array map. The error is triggered by the following LLVM patch: https://reviews.llvm.org/D87134 For example, the following code is from test_sysctl_loop1.c: static __always_inline int is_tcp_mem(struct bpf_sysctl *ctx) { volatile char tcp_mem_name[] = "net/ipv4/tcp_mem/very_very_very_very_long_pointless_string"; ... } Without the above LLVM patch, the compiler did optimization to load the string (59 bytes long) with 7 64bit loads, 1 8bit load and 1 16bit load, occupying 64 byte stack size. With the above LLVM patch, the compiler only uses 8bit loads, but subregister is 32bit. So stack requirements become 4 * 59 = 236 bytes. Together with other stuff on the stack, total stack size exceeds 512 bytes, hence compiler complains and quits. To fix the issue, removing "volatile" key word or changing "volatile" to "const"/"static const" does not work, the string is put in .rodata.str1.1 section, which libbpf did not process it and errors out with libbpf: elf: skipping unrecognized data section(6) .rodata.str1.1 libbpf: prog 'sysctl_tcp_mem': bad map relo against '.L__const.is_tcp_mem.tcp_mem_name' in section '.rodata.str1.1' Defining the string const as global variable can fix the issue as it puts the string constant in '.rodata' section which is recognized by libbpf. In the future, when libbpf can process '.rodata.str*.*' properly, the global definition can be changed back to local definition. Defining tcp_mem_name as a global, however, triggered a verifier failure. ./test_progs -n 7/21 libbpf: load bpf program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: invalid stack off=0 size=1 verification time 6975 usec stack depth 160+64 processed 889 insns (limit 1000000) max_states_per_insn 4 total_states 14 peak_states 14 mark_read 10 libbpf: -- END LOG -- libbpf: failed to load program 'sysctl_tcp_mem' libbpf: failed to load object 'test_sysctl_loop2.o' test_bpf_verif_scale:FAIL:114 #7/21 test_sysctl_loop2.o:FAIL This actually exposed a bpf program bug. In test_sysctl_loop{1,2}, we have code like const char tcp_mem_name[] = "<...long string...>"; ... char name[64]; ... for (i = 0; i < sizeof(tcp_mem_name); ++i) if (name[i] != tcp_mem_name[i]) return 0; In the above code, if sizeof(tcp_mem_name) > 64, name[i] access may be out of bound. The sizeof(tcp_mem_name) is 59 for test_sysctl_loop1.c and 79 for test_sysctl_loop2.c. Without promotion-to-global change, old compiler generates code where the overflowed stack access is actually filled with valid value, so hiding the bpf program bug. With promotion-to-global change, the code is different, more specifically, the previous loading constants to stack is gone, and "name" occupies stack[-64:0] and overflow access triggers a verifier error. To fix the issue, adjust "name" buffer size properly. Reported-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> --- tools/testing/selftests/bpf/progs/test_sysctl_loop1.c | 2 +- tools/testing/selftests/bpf/progs/test_sysctl_loop2.c | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) Changelog: v1 -> v2: . The tcp_mem_name change actually triggers a verifier failure due to a bpf program bug. Fixing the bpf program bug can make test pass with both old and latest llvm. (Alexei)
error likes: error: progs/test_sysctl_loop1.c:23:16: in function sysctl_tcp_mem i32 (%struct.bpf_sysctl*): Looks like the BPF stack limit of 512 bytes is exceeded. Please move large on stack variables into BPF per-cpu array map. The error is triggered by the following LLVM patch: https://reviews.llvm.org/D87134 For example, the following code is from test_sysctl_loop1.c: static __always_inline int is_tcp_mem(struct bpf_sysctl *ctx) { volatile char tcp_mem_name[] = "net/ipv4/tcp_mem/very_very_very_very_long_pointless_string"; ... } Without the above LLVM patch, the compiler did optimization to load the string (59 bytes long) with 7 64bit loads, 1 8bit load and 1 16bit load, occupying 64 byte stack size. With the above LLVM patch, the compiler only uses 8bit loads, but subregister is 32bit. So stack requirements become 4 * 59 = 236 bytes. Together with other stuff on the stack, total stack size exceeds 512 bytes, hence compiler complains and quits. To fix the issue, removing "volatile" key word or changing "volatile" to "const"/"static const" does not work, the string is put in .rodata.str1.1 section, which libbpf did not process it and errors out with libbpf: elf: skipping unrecognized data section(6) .rodata.str1.1 libbpf: prog 'sysctl_tcp_mem': bad map relo against '.L__const.is_tcp_mem.tcp_mem_name' in section '.rodata.str1.1' Defining the string const as global variable can fix the issue as it puts the string constant in '.rodata' section which is recognized by libbpf. In the future, when libbpf can process '.rodata.str*.*' properly, the global definition can be changed back to local definition. Defining tcp_mem_name as a global, however, triggered a verifier failure. ./test_progs -n 7/21 libbpf: load bpf program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: invalid stack off=0 size=1 verification time 6975 usec stack depth 160+64 processed 889 insns (limit 1000000) max_states_per_insn 4 total_states 14 peak_states 14 mark_read 10 libbpf: -- END LOG -- libbpf: failed to load program 'sysctl_tcp_mem' libbpf: failed to load object 'test_sysctl_loop2.o' test_bpf_verif_scale:FAIL:114 #7/21 test_sysctl_loop2.o:FAIL This actually exposed a bpf program bug. In test_sysctl_loop{1,2}, we have code like const char tcp_mem_name[] = "<...long string...>"; ... char name[64]; ... for (i = 0; i < sizeof(tcp_mem_name); ++i) if (name[i] != tcp_mem_name[i]) return 0; In the above code, if sizeof(tcp_mem_name) > 64, name[i] access may be out of bound. The sizeof(tcp_mem_name) is 59 for test_sysctl_loop1.c and 79 for test_sysctl_loop2.c. Without promotion-to-global change, old compiler generates code where the overflowed stack access is actually filled with valid value, so hiding the bpf program bug. With promotion-to-global change, the code is different, more specifically, the previous loading constants to stack is gone, and "name" occupies stack[-64:0] and overflow access triggers a verifier error. To fix the issue, adjust "name" buffer size properly. Reported-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> --- tools/testing/selftests/bpf/progs/test_sysctl_loop1.c | 4 ++-- tools/testing/selftests/bpf/progs/test_sysctl_loop2.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) Changelog: v2 -> v3: . using sizeof(tcp_mem_name) instead of hardcoded value for local buf "name". (Andrii) v1 -> v2: . The tcp_mem_name change actually triggers a verifier failure due to a bpf program bug. Fixing the bpf program bug can make test pass with both old and latest llvm. (Alexei)
Andrii reported that with latest clang, when building selftests, we have error likes: error: progs/test_sysctl_loop1.c:23:16: in function sysctl_tcp_mem i32 (%struct.bpf_sysctl*): Looks like the BPF stack limit of 512 bytes is exceeded. Please move large on stack variables into BPF per-cpu array map. The error is triggered by the following LLVM patch: https://reviews.llvm.org/D87134 For example, the following code is from test_sysctl_loop1.c: static __always_inline int is_tcp_mem(struct bpf_sysctl *ctx) { volatile char tcp_mem_name[] = "net/ipv4/tcp_mem/very_very_very_very_long_pointless_string"; ... } Without the above LLVM patch, the compiler did optimization to load the string (59 bytes long) with 7 64bit loads, 1 8bit load and 1 16bit load, occupying 64 byte stack size. With the above LLVM patch, the compiler only uses 8bit loads, but subregister is 32bit. So stack requirements become 4 * 59 = 236 bytes. Together with other stuff on the stack, total stack size exceeds 512 bytes, hence compiler complains and quits. To fix the issue, removing "volatile" key word or changing "volatile" to "const"/"static const" does not work, the string is put in .rodata.str1.1 section, which libbpf did not process it and errors out with libbpf: elf: skipping unrecognized data section(6) .rodata.str1.1 libbpf: prog 'sysctl_tcp_mem': bad map relo against '.L__const.is_tcp_mem.tcp_mem_name' in section '.rodata.str1.1' Defining the string const as global variable can fix the issue as it puts the string constant in '.rodata' section which is recognized by libbpf. In the future, when libbpf can process '.rodata.str*.*' properly, the global definition can be changed back to local definition. Defining tcp_mem_name as a global, however, triggered a verifier failure. ./test_progs -n 7/21 libbpf: load bpf program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: invalid stack off=0 size=1 verification time 6975 usec stack depth 160+64 processed 889 insns (limit 1000000) max_states_per_insn 4 total_states 14 peak_states 14 mark_read 10 libbpf: -- END LOG -- libbpf: failed to load program 'sysctl_tcp_mem' libbpf: failed to load object 'test_sysctl_loop2.o' test_bpf_verif_scale:FAIL:114 #7/21 test_sysctl_loop2.o:FAIL This actually exposed a bpf program bug. In test_sysctl_loop{1,2}, we have code like const char tcp_mem_name[] = "<...long string...>"; ... char name[64]; ... for (i = 0; i < sizeof(tcp_mem_name); ++i) if (name[i] != tcp_mem_name[i]) return 0; In the above code, if sizeof(tcp_mem_name) > 64, name[i] access may be out of bound. The sizeof(tcp_mem_name) is 59 for test_sysctl_loop1.c and 79 for test_sysctl_loop2.c. Without promotion-to-global change, old compiler generates code where the overflowed stack access is actually filled with valid value, so hiding the bpf program bug. With promotion-to-global change, the code is different, more specifically, the previous loading constants to stack is gone, and "name" occupies stack[-64:0] and overflow access triggers a verifier error. To fix the issue, adjust "name" buffer size properly. Reported-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200909171542.3673449-1-yhs@fb.com
I got the following lockdep splat while testing: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc7-00172-g021118712e59 #932 Not tainted ------------------------------------------------------ btrfs/229626 is trying to acquire lock: ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450 but task is already holding lock: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #7 (&fs_info->scrub_lock){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_scrub_dev+0x11c/0x630 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4 btrfs_ioctl+0x2799/0x30a0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_run_dev_stats+0x49/0x480 commit_cowonly_roots+0xb5/0x2a0 btrfs_commit_transaction+0x516/0xa60 sync_filesystem+0x6b/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 __prepare_exit_to_usermode+0x1cc/0x1e0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_commit_transaction+0x4bb/0xa60 sync_filesystem+0x6b/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 __prepare_exit_to_usermode+0x1cc/0x1e0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_record_root_in_trans+0x43/0x70 start_transaction+0xd1/0x5d0 btrfs_dirty_inode+0x42/0xd0 touch_atime+0xa1/0xd0 btrfs_file_mmap+0x3f/0x60 mmap_region+0x3a4/0x640 do_mmap+0x376/0x580 vm_mmap_pgoff+0xd5/0x120 ksys_mmap_pgoff+0x193/0x230 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #3 (&mm->mmap_lock#2){++++}-{3:3}: __might_fault+0x68/0x90 _copy_to_user+0x1e/0x80 perf_read+0x141/0x2c0 vfs_read+0xad/0x1b0 ksys_read+0x5f/0xe0 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (&cpuctx_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 perf_event_init_cpu+0x88/0x150 perf_event_init+0x1db/0x20b start_kernel+0x3ae/0x53c secondary_startup_64+0xa4/0xb0 -> #1 (pmus_lock){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 perf_event_init_cpu+0x4f/0x150 cpuhp_invoke_callback+0xb1/0x900 _cpu_up.constprop.26+0x9f/0x130 cpu_up+0x7b/0xc0 bringup_nonboot_cpus+0x4f/0x60 smp_init+0x26/0x71 kernel_init_freeable+0x110/0x258 kernel_init+0xa/0x103 ret_from_fork+0x1f/0x30 -> #0 (cpu_hotplug_lock){++++}-{0:0}: __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 cpus_read_lock+0x39/0xb0 alloc_workqueue+0x378/0x450 __btrfs_alloc_workqueue+0x15d/0x200 btrfs_alloc_workqueue+0x51/0x160 scrub_workers_get+0x5a/0x170 btrfs_scrub_dev+0x18c/0x630 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4 btrfs_ioctl+0x2799/0x30a0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->scrub_lock); lock(&fs_devs->device_list_mutex); lock(&fs_info->scrub_lock); lock(cpu_hotplug_lock); *** DEADLOCK *** 2 locks held by btrfs/229626: #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630 #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630 stack backtrace: CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932 Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x165/0x180 __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 ? alloc_workqueue+0x378/0x450 cpus_read_lock+0x39/0xb0 ? alloc_workqueue+0x378/0x450 alloc_workqueue+0x378/0x450 ? rcu_read_lock_sched_held+0x52/0x80 __btrfs_alloc_workqueue+0x15d/0x200 btrfs_alloc_workqueue+0x51/0x160 scrub_workers_get+0x5a/0x170 btrfs_scrub_dev+0x18c/0x630 ? start_transaction+0xd1/0x5d0 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4 btrfs_ioctl+0x2799/0x30a0 ? do_sigaction+0x102/0x250 ? lockdep_hardirqs_on_prepare+0xca/0x160 ? _raw_spin_unlock_irq+0x24/0x30 ? trace_hardirqs_on+0x1c/0xe0 ? _raw_spin_unlock_irq+0x24/0x30 ? do_sigaction+0x102/0x250 ? ksys_ioctl+0x83/0xc0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happens because we're allocating the scrub workqueues under the scrub and device list mutex, which brings in a whole host of other dependencies. Because the work queue allocation is done with GFP_KERNEL, it can trigger reclaim, which can lead to a transaction commit, which in turns needs the device_list_mutex, it can lead to a deadlock. A different problem for which this fix is a solution. Fix this by moving the actual allocation outside of the scrub lock, and then only take the lock once we're ready to actually assign them to the fs_info. We'll now have to cleanup the workqueues in a few more places, so I've added a helper to do the refcount dance to safely free the workqueues. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…s metrics" test Linux 5.9 introduced perf test case "Parse and process metrics" and on s390 this test case always dumps core: [root@t35lp67 perf]# ./perf test -vvvv -F 67 67: Parse and process metrics : --- start --- metric expr inst_retired.any / cpu_clk_unhalted.thread for IPC parsing metric: inst_retired.any / cpu_clk_unhalted.thread Segmentation fault (core dumped) [root@t35lp67 perf]# I debugged this core dump and gdb shows this call chain: (gdb) where #0 0x000003ffabc3192a in __strnlen_c_1 () from /lib64/libc.so.6 #1 0x000003ffabc293de in strcasestr () from /lib64/libc.so.6 #2 0x0000000001102ba2 in match_metric(list=0x1e6ea20 "inst_retired.any", n=<optimized out>) at util/metricgroup.c:368 #3 find_metric (map=<optimized out>, map=<optimized out>, metric=0x1e6ea20 "inst_retired.any") at util/metricgroup.c:765 #4 __resolve_metric (ids=0x0, map=<optimized out>, metric_list=0x0, metric_no_group=<optimized out>, m=<optimized out>) at util/metricgroup.c:844 #5 resolve_metric (ids=0x0, map=0x0, metric_list=0x0, metric_no_group=<optimized out>) at util/metricgroup.c:881 #6 metricgroup__add_metric (metric=<optimized out>, metric_no_group=metric_no_group@entry=false, events=<optimized out>, events@entry=0x3ffd84fb878, metric_list=0x0, metric_list@entry=0x3ffd84fb868, map=0x0) at util/metricgroup.c:943 #7 0x00000000011034ae in metricgroup__add_metric_list (map=0x13f9828 <map>, metric_list=0x3ffd84fb868, events=0x3ffd84fb878, metric_no_group=<optimized out>, list=<optimized out>) at util/metricgroup.c:988 #8 parse_groups (perf_evlist=perf_evlist@entry=0x1e70260, str=str@entry=0x12f34b2 "IPC", metric_no_group=<optimized out>, metric_no_merge=<optimized out>, fake_pmu=fake_pmu@entry=0x1462f18 <perf_pmu.fake>, metric_events=0x3ffd84fba58, map=0x1) at util/metricgroup.c:1040 #9 0x0000000001103eb2 in metricgroup__parse_groups_test( evlist=evlist@entry=0x1e70260, map=map@entry=0x13f9828 <map>, str=str@entry=0x12f34b2 "IPC", metric_no_group=metric_no_group@entry=false, metric_no_merge=metric_no_merge@entry=false, metric_events=0x3ffd84fba58) at util/metricgroup.c:1082 #10 0x00000000010c84d8 in __compute_metric (ratio2=0x0, name2=0x0, ratio1=<synthetic pointer>, name1=0x12f34b2 "IPC", vals=0x3ffd84fbad8, name=0x12f34b2 "IPC") at tests/parse-metric.c:159 #11 compute_metric (ratio=<synthetic pointer>, vals=0x3ffd84fbad8, name=0x12f34b2 "IPC") at tests/parse-metric.c:189 #12 test_ipc () at tests/parse-metric.c:208 ..... ..... omitted many more lines This test case was added with commit 218ca91 ("perf tests: Add parse metric test for frontend metric"). When I compile with make DEBUG=y it works fine and I do not get a core dump. It turned out that the above listed function call chain worked on a struct pmu_event array which requires a trailing element with zeroes which was missing. The marco map_for_each_event() loops over that array tests for members metric_expr/metric_name/metric_group being non-NULL. Adding this element fixes the issue. Output after: [root@t35lp46 perf]# ./perf test 67 67: Parse and process metrics : Ok [root@t35lp46 perf]# Committer notes: As Ian remarks, this is not s390 specific: <quote Ian> This also shows up with address sanitizer on all architectures (perhaps change the patch title) and perhaps add a "Fixes: <commit>" tag. ================================================================= ==4718==ERROR: AddressSanitizer: global-buffer-overflow on address 0x55c93b4d59e8 at pc 0x55c93a1541e2 bp 0x7ffd24327c60 sp 0x7ffd24327c58 READ of size 8 at 0x55c93b4d59e8 thread T0 #0 0x55c93a1541e1 in find_metric tools/perf/util/metricgroup.c:764:2 #1 0x55c93a153e6c in __resolve_metric tools/perf/util/metricgroup.c:844:9 #2 0x55c93a152f18 in resolve_metric tools/perf/util/metricgroup.c:881:9 #3 0x55c93a1528db in metricgroup__add_metric tools/perf/util/metricgroup.c:943:9 #4 0x55c93a151996 in metricgroup__add_metric_list tools/perf/util/metricgroup.c:988:9 #5 0x55c93a1511b9 in parse_groups tools/perf/util/metricgroup.c:1040:8 #6 0x55c93a1513e1 in metricgroup__parse_groups_test tools/perf/util/metricgroup.c:1082:9 #7 0x55c93a0108ae in __compute_metric tools/perf/tests/parse-metric.c:159:8 #8 0x55c93a010744 in compute_metric tools/perf/tests/parse-metric.c:189:9 #9 0x55c93a00f5ee in test_ipc tools/perf/tests/parse-metric.c:208:2 #10 0x55c93a00f1e8 in test__parse_metric tools/perf/tests/parse-metric.c:345:2 #11 0x55c939fd7202 in run_test tools/perf/tests/builtin-test.c:410:9 #12 0x55c939fd6736 in test_and_print tools/perf/tests/builtin-test.c:440:9 #13 0x55c939fd58c3 in __cmd_test tools/perf/tests/builtin-test.c:661:4 #14 0x55c939fd4e02 in cmd_test tools/perf/tests/builtin-test.c:807:9 #15 0x55c939e4763d in run_builtin tools/perf/perf.c:313:11 #16 0x55c939e46475 in handle_internal_command tools/perf/perf.c:365:8 #17 0x55c939e4737e in run_argv tools/perf/perf.c:409:2 #18 0x55c939e45f7e in main tools/perf/perf.c:539:3 0x55c93b4d59e8 is located 0 bytes to the right of global variable 'pme_test' defined in 'tools/perf/tests/parse-metric.c:17:25' (0x55c93b4d54a0) of size 1352 SUMMARY: AddressSanitizer: global-buffer-overflow tools/perf/util/metricgroup.c:764:2 in find_metric Shadow bytes around the buggy address: 0x0ab9a7692ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ab9a7692af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ab9a7692b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ab9a7692b10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ab9a7692b20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x0ab9a7692b30: 00 00 00 00 00 00 00 00 00 00 00 00 00[f9]f9 f9 0x0ab9a7692b40: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 0x0ab9a7692b50: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 0x0ab9a7692b60: f9 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00 0x0ab9a7692b70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ab9a7692b80: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc </quote> I'm also adding the missing "Fixes" tag and setting just .name to NULL, as doing it that way is more compact (the compiler will zero out everything else) and the table iterators look for .name being NULL as the sentinel marking the end of the table. Fixes: 0a507af ("perf tests: Add parse metric test for ipc metric") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Ian Rogers <irogers@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Link: http://lore.kernel.org/lkml/20200825071211.16959-1-tmricht@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Krzysztof Kozlowski says: ==================== nfc: s3fwrn5: Few cleanups Changes since v2: 1. Fix dtschema ID after rename (patch 1/8). 2. Apply patch 9/9 (defconfig change). Changes since v1: 1. Rename dtschema file and add additionalProperties:false, as Rob suggested, 2. Add Marek's tested-by, 3. New patches: #4, #5, #6, #7 and #9. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b972fdb ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()") didn't clear all the information from the scanned system and, more specifically, left ghes_hw.num_dimms to its previous value. On a second load (CONFIG_DEBUG_TEST_DRIVER_REMOVE=y), the driver would use the leftover num_dimms value which is not 0 and thus the 0 check in enumerate_dimms() will get bypassed and it would go directly to the pointer deref: d = &hw->dimms[hw->num_dimms]; which is, of course, NULL: #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #7 Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018 RIP: 0010:enumerate_dimms.cold+0x7b/0x375 Reset the whole ghes_hw on driver unregister so that no stale values are used on a second system scan. Fixes: b972fdb ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()") Cc: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200911164817.GA19320@zn.tnic
The aliases were never released causing the following leaks:
Indirect leak of 1224 byte(s) in 9 object(s) allocated from:
#0 0x7feefb830628 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x107628)
#1 0x56332c8f1b62 in __perf_pmu__new_alias util/pmu.c:322
#2 0x56332c8f401f in pmu_add_cpu_aliases_map util/pmu.c:778
#3 0x56332c792ce9 in __test__pmu_event_aliases tests/pmu-events.c:295
#4 0x56332c792ce9 in test_aliases tests/pmu-events.c:367
#5 0x56332c76a09b in run_test tests/builtin-test.c:410
#6 0x56332c76a09b in test_and_print tests/builtin-test.c:440
#7 0x56332c76ce69 in __cmd_test tests/builtin-test.c:695
#8 0x56332c76ce69 in cmd_test tests/builtin-test.c:807
#9 0x56332c7d2214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
#10 0x56332c6701a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
#11 0x56332c6701a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
#12 0x56332c6701a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
#13 0x7feefb359cc9 in __libc_start_main ../csu/libc-start.c:308
Fixes: 956a783 ("perf test: Test pmu-events aliases")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: John Garry <john.garry@huawei.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-11-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The evsel->unit borrows a pointer of pmu event or alias instead of
owns a string. But tool event (duration_time) passes a result of
strdup() caused a leak.
It was found by ASAN during metric test:
Direct leak of 210 byte(s) in 70 object(s) allocated from:
#0 0x7fe366fca0b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5)
#1 0x559fbbcc6ea3 in add_event_tool util/parse-events.c:414
#2 0x559fbbcc6ea3 in parse_events_add_tool util/parse-events.c:1414
#3 0x559fbbd8474d in parse_events_parse util/parse-events.y:439
#4 0x559fbbcc95da in parse_events__scanner util/parse-events.c:2096
#5 0x559fbbcc95da in __parse_events util/parse-events.c:2141
#6 0x559fbbc28555 in check_parse_id tests/pmu-events.c:406
#7 0x559fbbc28555 in check_parse_id tests/pmu-events.c:393
#8 0x559fbbc28555 in check_parse_cpu tests/pmu-events.c:415
#9 0x559fbbc28555 in test_parsing tests/pmu-events.c:498
#10 0x559fbbc0109b in run_test tests/builtin-test.c:410
#11 0x559fbbc0109b in test_and_print tests/builtin-test.c:440
#12 0x559fbbc03e69 in __cmd_test tests/builtin-test.c:695
#13 0x559fbbc03e69 in cmd_test tests/builtin-test.c:807
#14 0x559fbbc691f4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
#15 0x559fbbb071a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
#16 0x559fbbb071a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
#17 0x559fbbb071a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
#18 0x7fe366b68cc9 in __libc_start_main ../csu/libc-start.c:308
Fixes: f0fbb11 ("perf stat: Implement duration_time as a proper event")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-6-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The test_generic_metric() missed to release entries in the pctx. Asan
reported following leak (and more):
Direct leak of 128 byte(s) in 1 object(s) allocated from:
#0 0x7f4c9396980e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e)
#1 0x55f7e748cc14 in hashmap_grow (/home/namhyung/project/linux/tools/perf/perf+0x90cc14)
#2 0x55f7e748d497 in hashmap__insert (/home/namhyung/project/linux/tools/perf/perf+0x90d497)
#3 0x55f7e7341667 in hashmap__set /home/namhyung/project/linux/tools/perf/util/hashmap.h:111
#4 0x55f7e7341667 in expr__add_ref util/expr.c:120
#5 0x55f7e7292436 in prepare_metric util/stat-shadow.c:783
#6 0x55f7e729556d in test_generic_metric util/stat-shadow.c:858
#7 0x55f7e712390b in compute_single tests/parse-metric.c:128
#8 0x55f7e712390b in __compute_metric tests/parse-metric.c:180
#9 0x55f7e712446d in compute_metric tests/parse-metric.c:196
#10 0x55f7e712446d in test_dcache_l2 tests/parse-metric.c:295
#11 0x55f7e712446d in test__parse_metric tests/parse-metric.c:355
#12 0x55f7e70be09b in run_test tests/builtin-test.c:410
#13 0x55f7e70be09b in test_and_print tests/builtin-test.c:440
#14 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661
#15 0x55f7e70c101a in cmd_test tests/builtin-test.c:807
#16 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
#17 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
#18 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
#19 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
#20 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308
Fixes: 6d432c4 ("perf tools: Add test_generic_metric function")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-8-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The metricgroup__add_metric() can find multiple match for a metric group
and it's possible to fail. Also it can fail in the middle like in
resolve_metric() even for single metric.
In those cases, the intermediate list and ids will be leaked like:
Direct leak of 3 byte(s) in 1 object(s) allocated from:
#0 0x7f4c938f40b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5)
#1 0x55f7e71c1bef in __add_metric util/metricgroup.c:683
#2 0x55f7e71c31d0 in add_metric util/metricgroup.c:906
#3 0x55f7e71c3844 in metricgroup__add_metric util/metricgroup.c:940
#4 0x55f7e71c488d in metricgroup__add_metric_list util/metricgroup.c:993
#5 0x55f7e71c488d in parse_groups util/metricgroup.c:1045
#6 0x55f7e71c60a4 in metricgroup__parse_groups_test util/metricgroup.c:1087
#7 0x55f7e71235ae in __compute_metric tests/parse-metric.c:164
#8 0x55f7e7124650 in compute_metric tests/parse-metric.c:196
#9 0x55f7e7124650 in test_recursion_fail tests/parse-metric.c:318
#10 0x55f7e7124650 in test__parse_metric tests/parse-metric.c:356
#11 0x55f7e70be09b in run_test tests/builtin-test.c:410
#12 0x55f7e70be09b in test_and_print tests/builtin-test.c:440
#13 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661
#14 0x55f7e70c101a in cmd_test tests/builtin-test.c:807
#15 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
#16 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
#17 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
#18 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
#19 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308
Fixes: 83de0b7 ("perf metric: Collect referenced metrics in struct metric_ref_node")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-9-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The following leaks were detected by ASAN:
Indirect leak of 360 byte(s) in 9 object(s) allocated from:
#0 0x7fecc305180e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e)
#1 0x560578f6dce5 in perf_pmu__new_format util/pmu.c:1333
#2 0x560578f752fc in perf_pmu_parse util/pmu.y:59
#3 0x560578f6a8b7 in perf_pmu__format_parse util/pmu.c:73
#4 0x560578e07045 in test__pmu tests/pmu.c:155
#5 0x560578de109b in run_test tests/builtin-test.c:410
#6 0x560578de109b in test_and_print tests/builtin-test.c:440
#7 0x560578de401a in __cmd_test tests/builtin-test.c:661
#8 0x560578de401a in cmd_test tests/builtin-test.c:807
#9 0x560578e49354 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
#10 0x560578ce71a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
#11 0x560578ce71a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
#12 0x560578ce71a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
#13 0x7fecc2b7acc9 in __libc_start_main ../csu/libc-start.c:308
Fixes: cff7f95 ("perf tests: Move pmu tests into separate object")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-12-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Andrii Nakryiko says: ==================== This patch set introduces a new set of BTF APIs to libbpf that allow to conveniently produce BTF types and strings. These APIs will allow libbpf to do more intrusive modifications of program's BTF (by rewriting it, at least as of right now), which is necessary for the upcoming libbpf static linking. But they are complete and generic, so can be adopted by anyone who has a need to produce BTF type information. One such example outside of libbpf is pahole, which was actually converted to these APIs (locally, pending landing of these changes in libbpf) completely and shows reduction in amount of custom pahole code necessary and brings nice savings in memory usage (about 370MB reduction at peak for my kernel configuration) and even BTF deduplication times (one second reduction, 23.7s -> 22.7s). Memory savings are due to avoiding pahole's own copy of "uncompressed" raw BTF data. Time reduction comes from faster string search and deduplication by relying on hashmap instead of BST used by pahole's own code. Consequently, these APIs are already tested on real-world complicated kernel BTF, but there is also pretty extensive selftest doing extra validations. Selftests in patch #3 add a set of generic ASSERT_{EQ,STREQ,ERR,OK} macros that are useful for writing shorter and less repretitive selftests. I decided to keep them local to that selftest for now, but if they prove to be useful in more contexts we should move them to test_progs.h. And few more (e.g., inequality tests) macros are probably necessary to have a more complete set. Cc: Arnaldo Carvalho de Melo <acme@redhat.com> v2->v3: - resending original patches #7-9 as patches #1-3 due to merge conflict; v1->v2: - fixed comments (John); - renamed btf__append_xxx() into btf__add_xxx() (Alexei); - added btf__find_str() in addition to btf__add_str(); - btf__new_empty() now sets kernel FD to -1 initially. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ido Schimmel says: ==================== drop_monitor: Convert to use devlink tracepoint Drop monitor is able to monitor both software and hardware originated drops. Software drops are monitored by having drop monitor register its probe on the 'kfree_skb' tracepoint. Hardware originated drops are monitored by having devlink call into drop monitor whenever it receives a dropped packet from the underlying hardware. This patch set converts drop monitor to monitor both software and hardware originated drops in the same way - by registering its probe on the relevant tracepoint. In addition to drop monitor being more consistent, it is now also possible to build drop monitor as module instead of as a builtin and still monitor hardware originated drops. Initially, CONFIG_NET_DEVLINK implied CONFIG_NET_DROP_MONITOR, but after commit def2fbf ("kconfig: allow symbols implied by y to become m") we can have CONFIG_NET_DEVLINK=y and CONFIG_NET_DROP_MONITOR=m and hardware originated drops will not be monitored. Patch set overview: Patch #1 adds a tracepoint in devlink for trap reports. Patch #2 prepares probe functions in drop monitor for the new tracepoint. Patch #3 converts drop monitor to use the new tracepoint. Patches #4-#6 perform cleanups after the conversion. Patch #7 adds a test case for drop monitor. Both software originated drops and hardware originated drops (using netdevsim) are tested. Tested: | CONFIG_NET_DEVLINK | CONFIG_NET_DROP_MONITOR | Build | SW drops | HW drops | | -------------------|-------------------------|-------|----------|----------| | y | y | v | v | v | | y | m | v | v | v | | y | n | v | x | x | | n | y | v | v | x | | n | m | v | v | x | | n | n | v | x | x | ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
With latest llvm trunk, bpf programs under samples/bpf directory, if using CORE, may experience the following errors: LLVM ERROR: Cannot select: intrinsic %llvm.preserve.struct.access.index PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace. Stack dump: 0. Program arguments: llc -march=bpf -filetype=obj -o samples/bpf/test_probe_write_user_kern.o 1. Running pass 'Function Pass Manager' on module '<stdin>'. 2. Running pass 'BPF DAG->DAG Pattern Instruction Selection' on function '@bpf_prog1' #0 0x000000000183c26c llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x183c26c) ... #7 0x00000000017c375e (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x17c375e) #8 0x00000000016a75c5 llvm::SelectionDAGISel::CannotYetSelect(llvm::SDNode*) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x16a75c5) #9 0x00000000016ab4f8 llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x16ab4f8) ... Aborted (core dumped) | llc -march=bpf -filetype=obj -o samples/bpf/test_probe_write_user_kern.o The reason is due to llvm change https://reviews.llvm.org/D87153 where the CORE relocation global generation is moved from the beginning of target dependent optimization (llc) to the beginning of target independent optimization (opt). Since samples/bpf programs did not use vmlinux.h and its clang compilation uses native architecture, we need to adjust arch triple at opt level to do CORE relocation global generation properly. Otherwise, the above error will appear. This patch fixed the issue by introduce opt and llvm-dis to compilation chain, which will do proper CORE relocation global generation as well as O2 level optimization. Tested with llvm10, llvm11 and trunk/llvm12. Signed-off-by: Yonghong Song <yhs@fb.com>
With latest llvm trunk, bpf programs under samples/bpf directory, if using CORE, may experience the following errors: LLVM ERROR: Cannot select: intrinsic %llvm.preserve.struct.access.index PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace. Stack dump: 0. Program arguments: llc -march=bpf -filetype=obj -o samples/bpf/test_probe_write_user_kern.o 1. Running pass 'Function Pass Manager' on module '<stdin>'. 2. Running pass 'BPF DAG->DAG Pattern Instruction Selection' on function '@bpf_prog1' #0 0x000000000183c26c llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x183c26c) ... #7 0x00000000017c375e (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x17c375e) #8 0x00000000016a75c5 llvm::SelectionDAGISel::CannotYetSelect(llvm::SDNode*) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x16a75c5) #9 0x00000000016ab4f8 llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x16ab4f8) ... Aborted (core dumped) | llc -march=bpf -filetype=obj -o samples/bpf/test_probe_write_user_kern.o The reason is due to llvm change https://reviews.llvm.org/D87153 where the CORE relocation global generation is moved from the beginning of target dependent optimization (llc) to the beginning of target independent optimization (opt). Since samples/bpf programs did not use vmlinux.h and its clang compilation uses native architecture, we need to adjust arch triple at opt level to do CORE relocation global generation properly. Otherwise, the above error will appear. This patch fixed the issue by introduce opt and llvm-dis to compilation chain, which will do proper CORE relocation global generation as well as O2 level optimization. Tested with llvm10, llvm11 and trunk/llvm12. Signed-off-by: Yonghong Song <yhs@fb.com>
With latest llvm trunk, bpf programs under samples/bpf directory, if using CORE, may experience the following errors: LLVM ERROR: Cannot select: intrinsic %llvm.preserve.struct.access.index PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace. Stack dump: 0. Program arguments: llc -march=bpf -filetype=obj -o samples/bpf/test_probe_write_user_kern.o 1. Running pass 'Function Pass Manager' on module '<stdin>'. 2. Running pass 'BPF DAG->DAG Pattern Instruction Selection' on function '@bpf_prog1' #0 0x000000000183c26c llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x183c26c) ... #7 0x00000000017c375e (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x17c375e) #8 0x00000000016a75c5 llvm::SelectionDAGISel::CannotYetSelect(llvm::SDNode*) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x16a75c5) #9 0x00000000016ab4f8 llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x16ab4f8) ... Aborted (core dumped) | llc -march=bpf -filetype=obj -o samples/bpf/test_probe_write_user_kern.o The reason is due to llvm change https://reviews.llvm.org/D87153 where the CORE relocation global generation is moved from the beginning of target dependent optimization (llc) to the beginning of target independent optimization (opt). Since samples/bpf programs did not use vmlinux.h and its clang compilation uses native architecture, we need to adjust arch triple at opt level to do CORE relocation global generation properly. Otherwise, the above error will appear. This patch fixed the issue by introduce opt and llvm-dis to compilation chain, which will do proper CORE relocation global generation as well as O2 level optimization. Tested with llvm10, llvm11 and trunk/llvm12. Signed-off-by: Yonghong Song <yhs@fb.com>
With latest llvm trunk, bpf programs under samples/bpf directory, if using CORE, may experience the following errors: LLVM ERROR: Cannot select: intrinsic %llvm.preserve.struct.access.index PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace. Stack dump: 0. Program arguments: llc -march=bpf -filetype=obj -o samples/bpf/test_probe_write_user_kern.o 1. Running pass 'Function Pass Manager' on module '<stdin>'. 2. Running pass 'BPF DAG->DAG Pattern Instruction Selection' on function '@bpf_prog1' #0 0x000000000183c26c llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x183c26c) ... #7 0x00000000017c375e (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x17c375e) #8 0x00000000016a75c5 llvm::SelectionDAGISel::CannotYetSelect(llvm::SDNode*) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x16a75c5) #9 0x00000000016ab4f8 llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) (/data/users/yhs/work/llvm-project/llvm/build.cur/install/bin/llc+0x16ab4f8) ... Aborted (core dumped) | llc -march=bpf -filetype=obj -o samples/bpf/test_probe_write_user_kern.o The reason is due to llvm change https://reviews.llvm.org/D87153 where the CORE relocation global generation is moved from the beginning of target dependent optimization (llc) to the beginning of target independent optimization (opt). Since samples/bpf programs did not use vmlinux.h and its clang compilation uses native architecture, we need to adjust arch triple at opt level to do CORE relocation global generation properly. Otherwise, the above error will appear. This patch fixed the issue by introduce opt and llvm-dis to compilation chain, which will do proper CORE relocation global generation as well as O2 level optimization. Tested with llvm10, llvm11 and trunk/llvm12. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
There are two sites in atm mpoa code that believe the fetched object net_device is of lec type. However, both of them do just name checking to ensure that the device name starts with "lec" pattern string. That is, malicious user can hijack this by creating another device starting with that pattern, thereby causing type confusion. For example, create a *team* interface with lecX name, bind that interface and send messages will get a crash like below: [ 18.450000] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 18.452366] BUG: unable to handle page fault for address: ffff888005702a70 [ 18.454253] #PF: supervisor instruction fetch in kernel mode [ 18.455058] #PF: error_code(0x0011) - permissions violation [ 18.455366] PGD 3801067 P4D 3801067 PUD 3802067 PMD 80000000056000e3 [ 18.455725] Oops: 0011 [kernel-patches#1] PREEMPT SMP PTI [ 18.455966] CPU: 0 PID: 130 Comm: trigger Not tainted 6.1.90 kernel-patches#7 [ 18.456921] RIP: 0010:0xffff888005702a70 [ 18.457151] Code: ..... [ 18.458168] RSP: 0018:ffffc90000677bf8 EFLAGS: 00010286 [ 18.458461] RAX: ffff888005702a70 RBX: ffff888005702000 RCX: 000000000000001b [ 18.458850] RDX: ffffc90000677c10 RSI: ffff88800565e0a8 RDI: ffff888005702000 [ 18.459248] RBP: ffffc90000677c68 R08: 0000000000000000 R09: 0000000000000000 [ 18.459644] R10: 0000000000000000 R11: ffff888005702a70 R12: ffff88800556c000 [ 18.460033] R13: ffff888005964900 R14: ffff8880054b4000 R15: ffff8880054b5000 [ 18.460425] FS: 0000785e61b5a740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 18.460872] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.461183] CR2: ffff888005702a70 CR3: 00000000054c2000 CR4: 00000000000006f0 [ 18.461580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 18.461974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 18.462368] Call Trace: [ 18.462518] <TASK> [ 18.462645] ? __die_body+0x64/0xb0 [ 18.462856] ? page_fault_oops+0x353/0x3e0 [ 18.463092] ? exc_page_fault+0xaf/0xd0 [ 18.463322] ? asm_exc_page_fault+0x22/0x30 [ 18.463589] ? msg_from_mpoad+0x431/0x9d0 [ 18.463820] ? vcc_sendmsg+0x165/0x3b0 [ 18.464031] vcc_sendmsg+0x20a/0x3b0 [ 18.464238] ? wake_bit_function+0x80/0x80 [ 18.464511] __sys_sendto+0x38c/0x3a0 [ 18.464729] ? percpu_counter_add_batch+0x87/0xb0 [ 18.465002] __x64_sys_sendto+0x22/0x30 [ 18.465219] do_syscall_64+0x6c/0xa0 [ 18.465465] ? preempt_count_add+0x54/0xb0 [ 18.465697] ? up_read+0x37/0x80 [ 18.465883] ? do_user_addr_fault+0x25e/0x5b0 [ 18.466126] ? exit_to_user_mode_prepare+0x12/0xb0 [ 18.466435] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 18.466727] RIP: 0033:0x785e61be4407 [ 18.467948] RSP: 002b:00007ffe61ae2150 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 18.468368] RAX: ffffffffffffffda RBX: 0000785e61b5a740 RCX: 0000785e61be4407 [ 18.468758] RDX: 000000000000019c RSI: 00007ffe61ae21c0 RDI: 0000000000000003 [ 18.469149] RBP: 00007ffe61ae2370 R08: 0000000000000000 R09: 0000000000000000 [ 18.469542] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 18.469936] R13: 00007ffe61ae2498 R14: 0000785e61d74000 R15: 000057bddcbabd98 Correctly validating the net_device object has several methods. For example, function xgbe_netdev_event() checks `netdev_ops` field, function clip_device_event() checks `type` field. Considering the related variable `lec_netdev_ops` is not defined in the same file, so introduce another type value `ARPHRD_ATM_LANE` for a simple and correct check. By the way, this bug dates back to pre-git history (2.3.15), hence use the first reference for tracking. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: NipaLocal <nipa@local>
A crash in conntrack was reported while trying to unlink the conntrack
entry from the hash bucket list:
[exception RIP: __nf_ct_delete_from_lists+172]
[..]
kernel-patches#7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
kernel-patches#8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
kernel-patches#9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
[..]
The nf_conn struct is marked as allocated from slab but appears to be in
a partially initialised state:
ct hlist pointer is garbage; looks like the ct hash value
(hence crash).
ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
ct->timeout is 30000 (=30s), which is unexpected.
Everything else looks like normal udp conntrack entry. If we ignore
ct->status and pretend its 0, the entry matches those that are newly
allocated but not yet inserted into the hash:
- ct hlist pointers are overloaded and store/cache the raw tuple hash
- ct->timeout matches the relative time expected for a new udp flow
rather than the absolute 'jiffies' value.
If it were not for the presence of IPS_CONFIRMED,
__nf_conntrack_find_get() would have skipped the entry.
Theory is that we did hit following race:
cpu x cpu y cpu z
found entry E found entry E
E is expired <preemption>
nf_ct_delete()
return E to rcu slab
init_conntrack
E is re-inited,
ct->status set to 0
reply tuplehash hnnode.pprev
stores hash value.
cpu y found E right before it was deleted on cpu x.
E is now re-inited on cpu z. cpu y was preempted before
checking for expiry and/or confirm bit.
->refcnt set to 1
E now owned by skb
->timeout set to 30000
If cpu y were to resume now, it would observe E as
expired but would skip E due to missing CONFIRMED bit.
nf_conntrack_confirm gets called
sets: ct->status |= CONFIRMED
This is wrong: E is not yet added
to hashtable.
cpu y resumes, it observes E as expired but CONFIRMED:
<resumes>
nf_ct_expired()
-> yes (ct->timeout is 30s)
confirmed bit set.
cpu y will try to delete E from the hashtable:
nf_ct_delete() -> set DYING bit
__nf_ct_delete_from_lists
Even this scenario doesn't guarantee a crash:
cpu z still holds the table bucket lock(s) so y blocks:
wait for spinlock held by z
CONFIRMED is set but there is no
guarantee ct will be added to hash:
"chaintoolong" or "clash resolution"
logic both skip the insert step.
reply hnnode.pprev still stores the
hash value.
unlocks spinlock
return NF_DROP
<unblocks, then
crashes on hlist_nulls_del_rcu pprev>
In case CPU z does insert the entry into the hashtable, cpu y will unlink
E again right away but no crash occurs.
Without 'cpu y' race, 'garbage' hlist is of no consequence:
ct refcnt remains at 1, eventually skb will be free'd and E gets
destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.
To resolve this, move the IPS_CONFIRMED assignment after the table
insertion but before the unlock.
Pablo points out that the confirm-bit-store could be reordered to happen
before hlist add resp. the timeout fixup, so switch to set_bit and
before_atomic memory barrier to prevent this.
It doesn't matter if other CPUs can observe a newly inserted entry right
before the CONFIRMED bit was set:
Such event cannot be distinguished from above "E is the old incarnation"
case: the entry will be skipped.
Also change nf_ct_should_gc() to first check the confirmed bit.
The gc sequence is:
1. Check if entry has expired, if not skip to next entry
2. Obtain a reference to the expired entry.
3. Call nf_ct_should_gc() to double-check step 1.
nf_ct_should_gc() is thus called only for entries that already failed an
expiry check. After this patch, once the confirmed bit check passes
ct->timeout has been altered to reflect the absolute 'best before' date
instead of a relative time. Step 3 will therefore not remove the entry.
Without this change to nf_ct_should_gc() we could still get this sequence:
1. Check if entry has expired.
2. Obtain a reference.
3. Call nf_ct_should_gc() to double-check step 1:
4 - entry is still observed as expired
5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
and confirm bit gets set
6 - confirm bit is seen
7 - valid entry is removed again
First do check 6), then 4) so the gc expiry check always picks up either
confirmed bit unset (entry gets skipped) or expiry re-check failure for
re-inited conntrack objects.
This change cannot be backported to releases before 5.19. Without
commit 8a75a2c ("netfilter: conntrack: remove unconfirmed list")
|= IPS_CONFIRMED line cannot be moved without further changes.
Cc: Razvan Cojocaru <rzvncj@gmail.com>
Link: https://lore.kernel.org/netfilter-devel/20250627142758.25664-1-fw@strlen.de/
Link: https://lore.kernel.org/netfilter-devel/4239da15-83ff-4ca4-939d-faef283471bb@gmail.com/
Fixes: 1397af5 ("netfilter: conntrack: remove the percpu dying list")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NipaLocal <nipa@local>
A crash in conntrack was reported while trying to unlink the conntrack
entry from the hash bucket list:
[exception RIP: __nf_ct_delete_from_lists+172]
[..]
kernel-patches#7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
kernel-patches#8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
kernel-patches#9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
[..]
The nf_conn struct is marked as allocated from slab but appears to be in
a partially initialised state:
ct hlist pointer is garbage; looks like the ct hash value
(hence crash).
ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
ct->timeout is 30000 (=30s), which is unexpected.
Everything else looks like normal udp conntrack entry. If we ignore
ct->status and pretend its 0, the entry matches those that are newly
allocated but not yet inserted into the hash:
- ct hlist pointers are overloaded and store/cache the raw tuple hash
- ct->timeout matches the relative time expected for a new udp flow
rather than the absolute 'jiffies' value.
If it were not for the presence of IPS_CONFIRMED,
__nf_conntrack_find_get() would have skipped the entry.
Theory is that we did hit following race:
cpu x cpu y cpu z
found entry E found entry E
E is expired <preemption>
nf_ct_delete()
return E to rcu slab
init_conntrack
E is re-inited,
ct->status set to 0
reply tuplehash hnnode.pprev
stores hash value.
cpu y found E right before it was deleted on cpu x.
E is now re-inited on cpu z. cpu y was preempted before
checking for expiry and/or confirm bit.
->refcnt set to 1
E now owned by skb
->timeout set to 30000
If cpu y were to resume now, it would observe E as
expired but would skip E due to missing CONFIRMED bit.
nf_conntrack_confirm gets called
sets: ct->status |= CONFIRMED
This is wrong: E is not yet added
to hashtable.
cpu y resumes, it observes E as expired but CONFIRMED:
<resumes>
nf_ct_expired()
-> yes (ct->timeout is 30s)
confirmed bit set.
cpu y will try to delete E from the hashtable:
nf_ct_delete() -> set DYING bit
__nf_ct_delete_from_lists
Even this scenario doesn't guarantee a crash:
cpu z still holds the table bucket lock(s) so y blocks:
wait for spinlock held by z
CONFIRMED is set but there is no
guarantee ct will be added to hash:
"chaintoolong" or "clash resolution"
logic both skip the insert step.
reply hnnode.pprev still stores the
hash value.
unlocks spinlock
return NF_DROP
<unblocks, then
crashes on hlist_nulls_del_rcu pprev>
In case CPU z does insert the entry into the hashtable, cpu y will unlink
E again right away but no crash occurs.
Without 'cpu y' race, 'garbage' hlist is of no consequence:
ct refcnt remains at 1, eventually skb will be free'd and E gets
destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.
To resolve this, move the IPS_CONFIRMED assignment after the table
insertion but before the unlock.
Pablo points out that the confirm-bit-store could be reordered to happen
before hlist add resp. the timeout fixup, so switch to set_bit and
before_atomic memory barrier to prevent this.
It doesn't matter if other CPUs can observe a newly inserted entry right
before the CONFIRMED bit was set:
Such event cannot be distinguished from above "E is the old incarnation"
case: the entry will be skipped.
Also change nf_ct_should_gc() to first check the confirmed bit.
The gc sequence is:
1. Check if entry has expired, if not skip to next entry
2. Obtain a reference to the expired entry.
3. Call nf_ct_should_gc() to double-check step 1.
nf_ct_should_gc() is thus called only for entries that already failed an
expiry check. After this patch, once the confirmed bit check passes
ct->timeout has been altered to reflect the absolute 'best before' date
instead of a relative time. Step 3 will therefore not remove the entry.
Without this change to nf_ct_should_gc() we could still get this sequence:
1. Check if entry has expired.
2. Obtain a reference.
3. Call nf_ct_should_gc() to double-check step 1:
4 - entry is still observed as expired
5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
and confirm bit gets set
6 - confirm bit is seen
7 - valid entry is removed again
First do check 6), then 4) so the gc expiry check always picks up either
confirmed bit unset (entry gets skipped) or expiry re-check failure for
re-inited conntrack objects.
This change cannot be backported to releases before 5.19. Without
commit 8a75a2c ("netfilter: conntrack: remove unconfirmed list")
|= IPS_CONFIRMED line cannot be moved without further changes.
Cc: Razvan Cojocaru <rzvncj@gmail.com>
Link: https://lore.kernel.org/netfilter-devel/20250627142758.25664-1-fw@strlen.de/
Link: https://lore.kernel.org/netfilter-devel/4239da15-83ff-4ca4-939d-faef283471bb@gmail.com/
Fixes: 1397af5 ("netfilter: conntrack: remove the percpu dying list")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The hfsplus_bnode_read() method can trigger the issue: [ 174.852007][ T9784] ================================================================== [ 174.852709][ T9784] BUG: KASAN: slab-out-of-bounds in hfsplus_bnode_read+0x2f4/0x360 [ 174.853412][ T9784] Read of size 8 at addr ffff88810b5fc6c0 by task repro/9784 [ 174.854059][ T9784] [ 174.854272][ T9784] CPU: 1 UID: 0 PID: 9784 Comm: repro Not tainted 6.16.0-rc3 #7 PREEMPT(full) [ 174.854281][ T9784] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 174.854286][ T9784] Call Trace: [ 174.854289][ T9784] <TASK> [ 174.854292][ T9784] dump_stack_lvl+0x10e/0x1f0 [ 174.854305][ T9784] print_report+0xd0/0x660 [ 174.854315][ T9784] ? __virt_addr_valid+0x81/0x610 [ 174.854323][ T9784] ? __phys_addr+0xe8/0x180 [ 174.854330][ T9784] ? hfsplus_bnode_read+0x2f4/0x360 [ 174.854337][ T9784] kasan_report+0xc6/0x100 [ 174.854346][ T9784] ? hfsplus_bnode_read+0x2f4/0x360 [ 174.854354][ T9784] hfsplus_bnode_read+0x2f4/0x360 [ 174.854362][ T9784] hfsplus_bnode_dump+0x2ec/0x380 [ 174.854370][ T9784] ? __pfx_hfsplus_bnode_dump+0x10/0x10 [ 174.854377][ T9784] ? hfsplus_bnode_write_u16+0x83/0xb0 [ 174.854385][ T9784] ? srcu_gp_start+0xd0/0x310 [ 174.854393][ T9784] ? __mark_inode_dirty+0x29e/0xe40 [ 174.854402][ T9784] hfsplus_brec_remove+0x3d2/0x4e0 [ 174.854411][ T9784] __hfsplus_delete_attr+0x290/0x3a0 [ 174.854419][ T9784] ? __pfx_hfs_find_1st_rec_by_cnid+0x10/0x10 [ 174.854427][ T9784] ? __pfx___hfsplus_delete_attr+0x10/0x10 [ 174.854436][ T9784] ? __asan_memset+0x23/0x50 [ 174.854450][ T9784] hfsplus_delete_all_attrs+0x262/0x320 [ 174.854459][ T9784] ? __pfx_hfsplus_delete_all_attrs+0x10/0x10 [ 174.854469][ T9784] ? rcu_is_watching+0x12/0xc0 [ 174.854476][ T9784] ? __mark_inode_dirty+0x29e/0xe40 [ 174.854483][ T9784] hfsplus_delete_cat+0x845/0xde0 [ 174.854493][ T9784] ? __pfx_hfsplus_delete_cat+0x10/0x10 [ 174.854507][ T9784] hfsplus_unlink+0x1ca/0x7c0 [ 174.854516][ T9784] ? __pfx_hfsplus_unlink+0x10/0x10 [ 174.854525][ T9784] ? down_write+0x148/0x200 [ 174.854532][ T9784] ? __pfx_down_write+0x10/0x10 [ 174.854540][ T9784] vfs_unlink+0x2fe/0x9b0 [ 174.854549][ T9784] do_unlinkat+0x490/0x670 [ 174.854557][ T9784] ? __pfx_do_unlinkat+0x10/0x10 [ 174.854565][ T9784] ? __might_fault+0xbc/0x130 [ 174.854576][ T9784] ? getname_flags.part.0+0x1c5/0x550 [ 174.854584][ T9784] __x64_sys_unlink+0xc5/0x110 [ 174.854592][ T9784] do_syscall_64+0xc9/0x480 [ 174.854600][ T9784] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 174.854608][ T9784] RIP: 0033:0x7f6fdf4c3167 [ 174.854614][ T9784] Code: f0 ff ff 73 01 c3 48 8b 0d 26 0d 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 08 [ 174.854622][ T9784] RSP: 002b:00007ffcb948bca8 EFLAGS: 00000206 ORIG_RAX: 0000000000000057 [ 174.854630][ T9784] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f6fdf4c3167 [ 174.854636][ T9784] RDX: 00007ffcb948bcc0 RSI: 00007ffcb948bcc0 RDI: 00007ffcb948bd50 [ 174.854641][ T9784] RBP: 00007ffcb948cd90 R08: 0000000000000001 R09: 00007ffcb948bb40 [ 174.854645][ T9784] R10: 00007f6fdf564fc0 R11: 0000000000000206 R12: 0000561e1bc9c2d0 [ 174.854650][ T9784] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 174.854658][ T9784] </TASK> [ 174.854661][ T9784] [ 174.879281][ T9784] Allocated by task 9784: [ 174.879664][ T9784] kasan_save_stack+0x20/0x40 [ 174.880082][ T9784] kasan_save_track+0x14/0x30 [ 174.880500][ T9784] __kasan_kmalloc+0xaa/0xb0 [ 174.880908][ T9784] __kmalloc_noprof+0x205/0x550 [ 174.881337][ T9784] __hfs_bnode_create+0x107/0x890 [ 174.881779][ T9784] hfsplus_bnode_find+0x2d0/0xd10 [ 174.882222][ T9784] hfsplus_brec_find+0x2b0/0x520 [ 174.882659][ T9784] hfsplus_delete_all_attrs+0x23b/0x320 [ 174.883144][ T9784] hfsplus_delete_cat+0x845/0xde0 [ 174.883595][ T9784] hfsplus_rmdir+0x106/0x1b0 [ 174.884004][ T9784] vfs_rmdir+0x206/0x690 [ 174.884379][ T9784] do_rmdir+0x2b7/0x390 [ 174.884751][ T9784] __x64_sys_rmdir+0xc5/0x110 [ 174.885167][ T9784] do_syscall_64+0xc9/0x480 [ 174.885568][ T9784] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 174.886083][ T9784] [ 174.886293][ T9784] The buggy address belongs to the object at ffff88810b5fc600 [ 174.886293][ T9784] which belongs to the cache kmalloc-192 of size 192 [ 174.887507][ T9784] The buggy address is located 40 bytes to the right of [ 174.887507][ T9784] allocated 152-byte region [ffff88810b5fc600, ffff88810b5fc698) [ 174.888766][ T9784] [ 174.888976][ T9784] The buggy address belongs to the physical page: [ 174.889533][ T9784] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10b5fc [ 174.890295][ T9784] flags: 0x57ff00000000000(node=1|zone=2|lastcpupid=0x7ff) [ 174.890927][ T9784] page_type: f5(slab) [ 174.891284][ T9784] raw: 057ff00000000000 ffff88801b4423c0 ffffea000426dc80 dead000000000002 [ 174.892032][ T9784] raw: 0000000000000000 0000000080100010 00000000f5000000 0000000000000000 [ 174.892774][ T9784] page dumped because: kasan: bad access detected [ 174.893327][ T9784] page_owner tracks the page as allocated [ 174.893825][ T9784] page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52c00(GFP_NOIO|__GFP_NOWARN|__GFP_NO1 [ 174.895373][ T9784] post_alloc_hook+0x1c0/0x230 [ 174.895801][ T9784] get_page_from_freelist+0xdeb/0x3b30 [ 174.896284][ T9784] __alloc_frozen_pages_noprof+0x25c/0x2460 [ 174.896810][ T9784] alloc_pages_mpol+0x1fb/0x550 [ 174.897242][ T9784] new_slab+0x23b/0x340 [ 174.897614][ T9784] ___slab_alloc+0xd81/0x1960 [ 174.898028][ T9784] __slab_alloc.isra.0+0x56/0xb0 [ 174.898468][ T9784] __kmalloc_noprof+0x2b0/0x550 [ 174.898896][ T9784] usb_alloc_urb+0x73/0xa0 [ 174.899289][ T9784] usb_control_msg+0x1cb/0x4a0 [ 174.899718][ T9784] usb_get_string+0xab/0x1a0 [ 174.900133][ T9784] usb_string_sub+0x107/0x3c0 [ 174.900549][ T9784] usb_string+0x307/0x670 [ 174.900933][ T9784] usb_cache_string+0x80/0x150 [ 174.901355][ T9784] usb_new_device+0x1d0/0x19d0 [ 174.901786][ T9784] register_root_hub+0x299/0x730 [ 174.902231][ T9784] page last free pid 10 tgid 10 stack trace: [ 174.902757][ T9784] __free_frozen_pages+0x80c/0x1250 [ 174.903217][ T9784] vfree.part.0+0x12b/0xab0 [ 174.903645][ T9784] delayed_vfree_work+0x93/0xd0 [ 174.904073][ T9784] process_one_work+0x9b5/0x1b80 [ 174.904519][ T9784] worker_thread+0x630/0xe60 [ 174.904927][ T9784] kthread+0x3a8/0x770 [ 174.905291][ T9784] ret_from_fork+0x517/0x6e0 [ 174.905709][ T9784] ret_from_fork_asm+0x1a/0x30 [ 174.906128][ T9784] [ 174.906338][ T9784] Memory state around the buggy address: [ 174.906828][ T9784] ffff88810b5fc580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [ 174.907528][ T9784] ffff88810b5fc600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 174.908222][ T9784] >ffff88810b5fc680: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc [ 174.908917][ T9784] ^ [ 174.909481][ T9784] ffff88810b5fc700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 174.910432][ T9784] ffff88810b5fc780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [ 174.911401][ T9784] ================================================================== The reason of the issue that code doesn't check the correctness of the requested offset and length. As a result, incorrect value of offset or/and length could result in access out of allocated memory. This patch introduces is_bnode_offset_valid() method that checks the requested offset value. Also, it introduces check_and_correct_requested_length() method that checks and correct the requested length (if it is necessary). These methods are used in hfsplus_bnode_read(), hfsplus_bnode_write(), hfsplus_bnode_clear(), hfsplus_bnode_copy(), and hfsplus_bnode_move() with the goal to prevent the access out of allocated memory and triggering the crash. Reported-by: Kun Hu <huk23@m.fudan.edu.cn> Reported-by: Jiaji Qin <jjtan24@m.fudan.edu.cn> Reported-by: Shuoran Bai <baishuoran@hrbeu.edu.cn> Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Link: https://lore.kernel.org/r/20250703214804.244077-1-slava@dubeyko.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
pert script tests fails with segmentation fault as below:
92: perf script tests:
--- start ---
test child forked, pid 103769
DB test
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.012 MB /tmp/perf-test-script.7rbftEpOzX/perf.data (9 samples) ]
/usr/libexec/perf-core/tests/shell/script.sh: line 35:
103780 Segmentation fault (core dumped)
perf script -i "${perfdatafile}" -s "${db_test}"
--- Cleaning up ---
---- end(-1) ----
92: perf script tests : FAILED!
Backtrace pointed to :
#0 0x0000000010247dd0 in maps.machine ()
#1 0x00000000101d178c in db_export.sample ()
#2 0x00000000103412c8 in python_process_event ()
#3 0x000000001004eb28 in process_sample_event ()
#4 0x000000001024fcd0 in machines.deliver_event ()
#5 0x000000001025005c in perf_session.deliver_event ()
#6 0x00000000102568b0 in __ordered_events__flush.part.0 ()
#7 0x0000000010251618 in perf_session.process_events ()
#8 0x0000000010053620 in cmd_script ()
#9 0x00000000100b5a28 in run_builtin ()
#10 0x00000000100b5f94 in handle_internal_command ()
#11 0x0000000010011114 in main ()
Further investigation reveals that this occurs in the `perf script tests`,
because it uses `db_test.py` script. This script sets `perf_db_export_mode = True`.
With `perf_db_export_mode` enabled, if a sample originates from a hypervisor,
perf doesn't set maps for "[H]" sample in the code. Consequently, `al->maps` remains NULL
when `maps__machine(al->maps)` is called from `db_export__sample`.
As al->maps can be NULL in case of Hypervisor samples , use thread->maps
because even for Hypervisor sample, machine should exist.
If we don't have machine for some reason, return -1 to avoid segmentation fault.
Reported-by: Disha Goel <disgoel@linux.ibm.com>
Signed-off-by: Aditya Bodkhe <aditya.b1@linux.ibm.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Disha Goel <disgoel@linux.ibm.com>
Link: https://lore.kernel.org/r/20250429065132.36839-1-adityab1@linux.ibm.com
Suggested-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Without the change `perf `hangs up on charaster devices. On my system
it's enough to run system-wide sampler for a few seconds to get the
hangup:
$ perf record -a -g --call-graph=dwarf
$ perf report
# hung
`strace` shows that hangup happens on reading on a character device
`/dev/dri/renderD128`
$ strace -y -f -p 2780484
strace: Process 2780484 attached
pread64(101</dev/dri/renderD128>, strace: Process 2780484 detached
It's call trace descends into `elfutils`:
$ gdb -p 2780484
(gdb) bt
#0 0x00007f5e508f04b7 in __libc_pread64 (fd=101, buf=0x7fff9df7edb0, count=0, offset=0)
at ../sysdeps/unix/sysv/linux/pread64.c:25
#1 0x00007f5e52b79515 in read_file () from /<<NIX>>/elfutils-0.192/lib/libelf.so.1
#2 0x00007f5e52b25666 in libdw_open_elf () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#3 0x00007f5e52b25907 in __libdw_open_file () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#4 0x00007f5e52b120a9 in dwfl_report_elf@@ELFUTILS_0.156 ()
from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#5 0x000000000068bf20 in __report_module (al=al@entry=0x7fff9df80010, ip=ip@entry=139803237033216, ui=ui@entry=0x5369b5e0)
at util/dso.h:537
#6 0x000000000068c3d1 in report_module (ip=139803237033216, ui=0x5369b5e0) at util/unwind-libdw.c:114
#7 frame_callback (state=0x535aef10, arg=0x5369b5e0) at util/unwind-libdw.c:242
#8 0x00007f5e52b261d3 in dwfl_thread_getframes () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#9 0x00007f5e52b25bdb in get_one_thread_cb () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#10 0x00007f5e52b25faa in dwfl_getthreads () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#11 0x00007f5e52b26514 in dwfl_getthread_frames () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#12 0x000000000068c6ce in unwind__get_entries (cb=cb@entry=0x5d4620 <unwind_entry>, arg=arg@entry=0x10cd5fa0,
thread=thread@entry=0x1076a290, data=data@entry=0x7fff9df80540, max_stack=max_stack@entry=127,
best_effort=best_effort@entry=false) at util/thread.h:152
#13 0x00000000005dae95 in thread__resolve_callchain_unwind (evsel=0x106006d0, thread=0x1076a290, cursor=0x10cd5fa0,
sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2939
#14 thread__resolve_callchain_unwind (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, sample=0x7fff9df80540,
max_stack=127, symbols=true) at util/machine.c:2920
#15 __thread__resolve_callchain (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, evsel@entry=0x7fff9df80440,
sample=0x7fff9df80540, parent=parent@entry=0x7fff9df804a0, root_al=root_al@entry=0x7fff9df80440, max_stack=127, symbols=true)
at util/machine.c:2970
#16 0x00000000005d0cb2 in thread__resolve_callchain (thread=<optimized out>, cursor=<optimized out>, evsel=0x7fff9df80440,
sample=<optimized out>, parent=0x7fff9df804a0, root_al=0x7fff9df80440, max_stack=127) at util/machine.h:198
#17 sample__resolve_callchain (sample=<optimized out>, cursor=<optimized out>, parent=parent@entry=0x7fff9df804a0,
evsel=evsel@entry=0x106006d0, al=al@entry=0x7fff9df80440, max_stack=max_stack@entry=127) at util/callchain.c:1127
#18 0x0000000000617e08 in hist_entry_iter__add (iter=iter@entry=0x7fff9df80480, al=al@entry=0x7fff9df80440, max_stack_depth=127,
arg=arg@entry=0x7fff9df81ae0) at util/hist.c:1255
#19 0x000000000045d2d0 in process_sample_event (tool=0x7fff9df81ae0, event=<optimized out>, sample=0x7fff9df80540,
evsel=0x106006d0, machine=<optimized out>) at builtin-report.c:334
#20 0x00000000005e3bb1 in perf_session__deliver_event (session=0x105ff2c0, event=0x7f5c7d735ca0, tool=0x7fff9df81ae0,
file_offset=2914716832, file_path=0x105ffbf0 "perf.data") at util/session.c:1367
#21 0x00000000005e8d93 in do_flush (oe=0x105ffa50, show_progress=false) at util/ordered-events.c:245
#22 __ordered_events__flush (oe=0x105ffa50, how=OE_FLUSH__ROUND, timestamp=<optimized out>) at util/ordered-events.c:324
#23 0x00000000005e1f64 in perf_session__process_user_event (session=0x105ff2c0, event=0x7f5c7d752b18, file_offset=2914835224,
file_path=0x105ffbf0 "perf.data") at util/session.c:1419
#24 0x00000000005e47c7 in reader__read_event (rd=rd@entry=0x7fff9df81260, session=session@entry=0x105ff2c0,
--Type <RET> for more, q to quit, c to continue without paging--
quit
prog=prog@entry=0x7fff9df81220) at util/session.c:2132
#25 0x00000000005e4b37 in reader__process_events (rd=0x7fff9df81260, session=0x105ff2c0, prog=0x7fff9df81220)
at util/session.c:2181
#26 __perf_session__process_events (session=0x105ff2c0) at util/session.c:2226
#27 perf_session__process_events (session=session@entry=0x105ff2c0) at util/session.c:2390
#28 0x0000000000460add in __cmd_report (rep=0x7fff9df81ae0) at builtin-report.c:1076
#29 cmd_report (argc=<optimized out>, argv=<optimized out>) at builtin-report.c:1827
#30 0x00000000004c5a40 in run_builtin (p=p@entry=0xd8f7f8 <commands+312>, argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0)
at perf.c:351
#31 0x00000000004c5d63 in handle_internal_command (argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:404
#32 0x0000000000442de3 in run_argv (argcp=<synthetic pointer>, argv=<synthetic pointer>) at perf.c:448
#33 main (argc=<optimized out>, argv=0x7fff9df844b0) at perf.c:556
The hangup happens because nothing in` perf` or `elfutils` checks if a
mapped file is easily readable.
The change conservatively skips all non-regular files.
Signed-off-by: Sergei Trofimovich <slyich@gmail.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250505174419.2814857-1-slyich@gmail.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Symbolize stack traces by creating a live machine. Add this
functionality to dump_stack and switch dump_stack users to use
it. Switch TUI to use it. Add stack traces to the child test function
which can be useful to diagnose blocked code.
Example output:
```
$ perf test -vv PERF_RECORD_
...
7: PERF_RECORD_* events & perf_sample fields:
7: PERF_RECORD_* events & perf_sample fields : Running (1 active)
^C
Signal (2) while running tests.
Terminating tests with the same signal
Internal test harness failure. Completing any started tests:
: 7: PERF_RECORD_* events & perf_sample fields:
---- unexpected signal (2) ----
#0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0
#1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#2 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64
#3 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72
#4 0x7fc12fef1393 in __nanosleep nanosleep.c:26
#5 0x7fc12ff02d68 in __sleep sleep.c:55
#6 0x55788c63196b in test__PERF_RECORD perf-record.c:0
#7 0x55788c620fb0 in run_test_child builtin-test.c:0
#8 0x55788c5bd18d in start_command run-command.c:127
#9 0x55788c621ef3 in __cmd_test builtin-test.c:0
#10 0x55788c6225bf in cmd_test ??:0
#11 0x55788c5afbd0 in run_builtin perf.c:0
#12 0x55788c5afeeb in handle_internal_command perf.c:0
#13 0x55788c52b383 in main ??:0
#14 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74
#15 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
#16 0x55788c52b9d1 in _start ??:0
---- unexpected signal (2) ----
#0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0
#1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#2 0x7fc12fea3a14 in pthread_sigmask@GLIBC_2.2.5 pthread_sigmask.c:45
#3 0x7fc12fe49fd9 in __GI___sigprocmask sigprocmask.c:26
#4 0x7fc12ff2601b in __longjmp_chk longjmp.c:36
#5 0x55788c6210c0 in print_test_result.isra.0 builtin-test.c:0
#6 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#7 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64
#8 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72
#9 0x7fc12fef1393 in __nanosleep nanosleep.c:26
#10 0x7fc12ff02d68 in __sleep sleep.c:55
#11 0x55788c63196b in test__PERF_RECORD perf-record.c:0
#12 0x55788c620fb0 in run_test_child builtin-test.c:0
#13 0x55788c5bd18d in start_command run-command.c:127
#14 0x55788c621ef3 in __cmd_test builtin-test.c:0
#15 0x55788c6225bf in cmd_test ??:0
#16 0x55788c5afbd0 in run_builtin perf.c:0
#17 0x55788c5afeeb in handle_internal_command perf.c:0
#18 0x55788c52b383 in main ??:0
#19 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74
#20 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
#21 0x55788c52b9d1 in _start ??:0
7: PERF_RECORD_* events & perf_sample fields : Skip (permissions)
```
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250624210500.2121303-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Calling perf top with branch filters enabled on Intel CPU's
with branch counters logging (A.K.A LBR event logging [1]) support
results in a segfault.
$ perf top -e '{cpu_core/cpu-cycles/,cpu_core/event=0xc6,umask=0x3,frontend=0x11,name=frontend_retired_dsb_miss/}' -j any,counter
...
Thread 27 "perf" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffafff76c0 (LWP 949003)]
perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653
653 *width = env->cpu_pmu_caps ? env->br_cntr_width :
(gdb) bt
#0 perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653
#1 0x00000000005b1599 in symbol__account_br_cntr (branch=0x7fffcc3db580, evsel=0xfea2d0, offset=12, br_cntr=8) at util/annotate.c:345
#2 0x00000000005b17fb in symbol__account_cycles (addr=5658172, start=5658160, sym=0x7fffcc0ee420, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:389
#3 0x00000000005b1976 in addr_map_symbol__account_cycles (ams=0x7fffcd7b01d0, start=0x7fffcd7b02b0, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:422
#4 0x000000000068d57f in hist__account_cycles (bs=0x110d288, al=0x7fffafff6540, sample=0x7fffafff6760, nonany_branch_mode=false, total_cycles=0x0, evsel=0xfea2d0) at util/hist.c:2850
#5 0x0000000000446216 in hist_iter__top_callback (iter=0x7fffafff6590, al=0x7fffafff6540, single=true, arg=0x7fffffff9e00) at builtin-top.c:737
#6 0x0000000000689787 in hist_entry_iter__add (iter=0x7fffafff6590, al=0x7fffafff6540, max_stack_depth=127, arg=0x7fffffff9e00) at util/hist.c:1359
#7 0x0000000000446710 in perf_event__process_sample (tool=0x7fffffff9e00, event=0x110d250, evsel=0xfea2d0, sample=0x7fffafff6760, machine=0x108c968) at builtin-top.c:845
#8 0x0000000000447735 in deliver_event (qe=0x7fffffffa120, qevent=0x10fc200) at builtin-top.c:1211
#9 0x000000000064ccae in do_flush (oe=0x7fffffffa120, show_progress=false) at util/ordered-events.c:245
#10 0x000000000064d005 in __ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP, timestamp=0) at util/ordered-events.c:324
#11 0x000000000064d0ef in ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP) at util/ordered-events.c:342
#12 0x00000000004472a9 in process_thread (arg=0x7fffffff9e00) at builtin-top.c:1120
#13 0x00007ffff6e7dba8 in start_thread (arg=<optimized out>) at pthread_create.c:448
#14 0x00007ffff6f01b8c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
The cause is that perf_env__find_br_cntr_info tries to access a
null pointer pmu_caps in the perf_env struct. A similar issue exists
for homogeneous core systems which use the cpu_pmu_caps structure.
Fix this by populating cpu_pmu_caps and pmu_caps structures with
values from sysfs when calling perf top with branch stack sampling
enabled.
[1], LBR event logging introduced here:
https://lore.kernel.org/all/20231025201626.3000228-5-kan.liang@linux.intel.com/
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Thomas Falcon <thomas.falcon@intel.com>
Link: https://lore.kernel.org/r/20250612163659.1357950-2-thomas.falcon@intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
As syzbot [1] reported as below:
R10: 0000000000000100 R11: 0000000000000206 R12: 00007ffe17473450
R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520
</TASK>
---[ end trace 0000000000000000 ]---
==================================================================
BUG: KASAN: use-after-free in __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62
Read of size 8 at addr ffff88812d962278 by task syz-executor/564
CPU: 1 PID: 564 Comm: syz-executor Tainted: G W 6.1.129-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
<TASK>
__dump_stack+0x21/0x24 lib/dump_stack.c:88
dump_stack_lvl+0xee/0x158 lib/dump_stack.c:106
print_address_description+0x71/0x210 mm/kasan/report.c:316
print_report+0x4a/0x60 mm/kasan/report.c:427
kasan_report+0x122/0x150 mm/kasan/report.c:531
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report_generic.c:351
__list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62
__list_del_entry include/linux/list.h:134 [inline]
list_del_init include/linux/list.h:206 [inline]
f2fs_inode_synced+0xf7/0x2e0 fs/f2fs/super.c:1531
f2fs_update_inode+0x74/0x1c40 fs/f2fs/inode.c:585
f2fs_update_inode_page+0x137/0x170 fs/f2fs/inode.c:703
f2fs_write_inode+0x4ec/0x770 fs/f2fs/inode.c:731
write_inode fs/fs-writeback.c:1460 [inline]
__writeback_single_inode+0x4a0/0xab0 fs/fs-writeback.c:1677
writeback_single_inode+0x221/0x8b0 fs/fs-writeback.c:1733
sync_inode_metadata+0xb6/0x110 fs/fs-writeback.c:2789
f2fs_sync_inode_meta+0x16d/0x2a0 fs/f2fs/checkpoint.c:1159
block_operations fs/f2fs/checkpoint.c:1269 [inline]
f2fs_write_checkpoint+0xca3/0x2100 fs/f2fs/checkpoint.c:1658
kill_f2fs_super+0x231/0x390 fs/f2fs/super.c:4668
deactivate_locked_super+0x98/0x100 fs/super.c:332
deactivate_super+0xaf/0xe0 fs/super.c:363
cleanup_mnt+0x45f/0x4e0 fs/namespace.c:1186
__cleanup_mnt+0x19/0x20 fs/namespace.c:1193
task_work_run+0x1c6/0x230 kernel/task_work.c:203
exit_task_work include/linux/task_work.h:39 [inline]
do_exit+0x9fb/0x2410 kernel/exit.c:871
do_group_exit+0x210/0x2d0 kernel/exit.c:1021
__do_sys_exit_group kernel/exit.c:1032 [inline]
__se_sys_exit_group kernel/exit.c:1030 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1030
x64_sys_call+0x7b4/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x68/0xd2
RIP: 0033:0x7f28b1b8e169
Code: Unable to access opcode bytes at 0x7f28b1b8e13f.
RSP: 002b:00007ffe174710a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f28b1c10879 RCX: 00007f28b1b8e169
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 0000000000000002 R08: 00007ffe1746ee47 R09: 00007ffe17472360
R10: 0000000000000009 R11: 0000000000000246 R12: 00007ffe17472360
R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520
</TASK>
Allocated by task 569:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4b/0x70 mm/kasan/common.c:52
kasan_save_alloc_info+0x25/0x30 mm/kasan/generic.c:505
__kasan_slab_alloc+0x72/0x80 mm/kasan/common.c:328
kasan_slab_alloc include/linux/kasan.h:201 [inline]
slab_post_alloc_hook+0x4f/0x2c0 mm/slab.h:737
slab_alloc_node mm/slub.c:3398 [inline]
slab_alloc mm/slub.c:3406 [inline]
__kmem_cache_alloc_lru mm/slub.c:3413 [inline]
kmem_cache_alloc_lru+0x104/0x220 mm/slub.c:3429
alloc_inode_sb include/linux/fs.h:3245 [inline]
f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419
alloc_inode fs/inode.c:261 [inline]
iget_locked+0x186/0x880 fs/inode.c:1373
f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483
f2fs_lookup+0x366/0xab0 fs/f2fs/namei.c:487
__lookup_slow+0x2a3/0x3d0 fs/namei.c:1690
lookup_slow+0x57/0x70 fs/namei.c:1707
walk_component+0x2e6/0x410 fs/namei.c:1998
lookup_last fs/namei.c:2455 [inline]
path_lookupat+0x180/0x490 fs/namei.c:2479
filename_lookup+0x1f0/0x500 fs/namei.c:2508
vfs_statx+0x10b/0x660 fs/stat.c:229
vfs_fstatat fs/stat.c:267 [inline]
vfs_lstat include/linux/fs.h:3424 [inline]
__do_sys_newlstat fs/stat.c:423 [inline]
__se_sys_newlstat+0xd5/0x350 fs/stat.c:417
__x64_sys_newlstat+0x5b/0x70 fs/stat.c:417
x64_sys_call+0x393/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:7
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x68/0xd2
Freed by task 13:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4b/0x70 mm/kasan/common.c:52
kasan_save_free_info+0x31/0x50 mm/kasan/generic.c:516
____kasan_slab_free+0x132/0x180 mm/kasan/common.c:236
__kasan_slab_free+0x11/0x20 mm/kasan/common.c:244
kasan_slab_free include/linux/kasan.h:177 [inline]
slab_free_hook mm/slub.c:1724 [inline]
slab_free_freelist_hook+0xc2/0x190 mm/slub.c:1750
slab_free mm/slub.c:3661 [inline]
kmem_cache_free+0x12d/0x2a0 mm/slub.c:3683
f2fs_free_inode+0x24/0x30 fs/f2fs/super.c:1562
i_callback+0x4c/0x70 fs/inode.c:250
rcu_do_batch+0x503/0xb80 kernel/rcu/tree.c:2297
rcu_core+0x5a2/0xe70 kernel/rcu/tree.c:2557
rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2574
handle_softirqs+0x178/0x500 kernel/softirq.c:578
run_ksoftirqd+0x28/0x30 kernel/softirq.c:945
smpboot_thread_fn+0x45a/0x8c0 kernel/smpboot.c:164
kthread+0x270/0x310 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
Last potentially related work creation:
kasan_save_stack+0x3a/0x60 mm/kasan/common.c:45
__kasan_record_aux_stack+0xb6/0xc0 mm/kasan/generic.c:486
kasan_record_aux_stack_noalloc+0xb/0x10 mm/kasan/generic.c:496
call_rcu+0xd4/0xf70 kernel/rcu/tree.c:2845
destroy_inode fs/inode.c:316 [inline]
evict+0x7da/0x870 fs/inode.c:720
iput_final fs/inode.c:1834 [inline]
iput+0x62b/0x830 fs/inode.c:1860
do_unlinkat+0x356/0x540 fs/namei.c:4397
__do_sys_unlink fs/namei.c:4438 [inline]
__se_sys_unlink fs/namei.c:4436 [inline]
__x64_sys_unlink+0x49/0x50 fs/namei.c:4436
x64_sys_call+0x958/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:88
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x68/0xd2
The buggy address belongs to the object at ffff88812d961f20
which belongs to the cache f2fs_inode_cache of size 1200
The buggy address is located 856 bytes inside of
1200-byte region [ffff88812d961f20, ffff88812d9623d0)
The buggy address belongs to the physical page:
page:ffffea0004b65800 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12d960
head:ffffea0004b65800 order:2 compound_mapcount:0 compound_pincount:0
flags: 0x4000000000010200(slab|head|zone=1)
raw: 4000000000010200 0000000000000000 dead000000000122 ffff88810a94c500
raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 2, migratetype Reclaimable, gfp_mask 0x1d2050(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_RECLAIMABLE), pid 569, tgid 568 (syz.2.16), ts 55943246141, free_ts 0
set_page_owner include/linux/page_owner.h:31 [inline]
post_alloc_hook+0x1d0/0x1f0 mm/page_alloc.c:2532
prep_new_page mm/page_alloc.c:2539 [inline]
get_page_from_freelist+0x2e63/0x2ef0 mm/page_alloc.c:4328
__alloc_pages+0x235/0x4b0 mm/page_alloc.c:5605
alloc_slab_page include/linux/gfp.h:-1 [inline]
allocate_slab mm/slub.c:1939 [inline]
new_slab+0xec/0x4b0 mm/slub.c:1992
___slab_alloc+0x6f6/0xb50 mm/slub.c:3180
__slab_alloc+0x5e/0xa0 mm/slub.c:3279
slab_alloc_node mm/slub.c:3364 [inline]
slab_alloc mm/slub.c:3406 [inline]
__kmem_cache_alloc_lru mm/slub.c:3413 [inline]
kmem_cache_alloc_lru+0x13f/0x220 mm/slub.c:3429
alloc_inode_sb include/linux/fs.h:3245 [inline]
f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419
alloc_inode fs/inode.c:261 [inline]
iget_locked+0x186/0x880 fs/inode.c:1373
f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483
f2fs_fill_super+0x3ad7/0x6bb0 fs/f2fs/super.c:4293
mount_bdev+0x2ae/0x3e0 fs/super.c:1443
f2fs_mount+0x34/0x40 fs/f2fs/super.c:4642
legacy_get_tree+0xea/0x190 fs/fs_context.c:632
vfs_get_tree+0x89/0x260 fs/super.c:1573
do_new_mount+0x25a/0xa20 fs/namespace.c:3056
page_owner free stack trace missing
Memory state around the buggy address:
ffff88812d962100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88812d962180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88812d962200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88812d962280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88812d962300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
[1] https://syzkaller.appspot.com/x/report.txt?x=13448368580000
This bug can be reproduced w/ the reproducer [2], once we enable
CONFIG_F2FS_CHECK_FS config, the reproducer will trigger panic as below,
so the direct reason of this bug is the same as the one below patch [3]
fixed.
kernel BUG at fs/f2fs/inode.c:857!
RIP: 0010:f2fs_evict_inode+0x1204/0x1a20
Call Trace:
<TASK>
evict+0x32a/0x7a0
do_unlinkat+0x37b/0x5b0
__x64_sys_unlink+0xad/0x100
do_syscall_64+0x5a/0xb0
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0010:f2fs_evict_inode+0x1204/0x1a20
[2] https://syzkaller.appspot.com/x/repro.c?x=17495ccc580000
[3] https://lore.kernel.org/linux-f2fs-devel/20250702120321.1080759-1-chao@kernel.org
Tracepoints before panic:
f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file1
f2fs_unlink_exit: dev = (7,0), ino = 7, ret = 0
f2fs_evict_inode: dev = (7,0), ino = 7, pino = 3, i_mode = 0x81ed, i_size = 10, i_nlink = 0, i_blocks = 0, i_advise = 0x0
f2fs_truncate_node: dev = (7,0), ino = 7, nid = 8, block_address = 0x3c05
f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file3
f2fs_unlink_exit: dev = (7,0), ino = 8, ret = 0
f2fs_evict_inode: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 9000, i_nlink = 0, i_blocks = 24, i_advise = 0x4
f2fs_truncate: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 0, i_nlink = 0, i_blocks = 24, i_advise = 0x4
f2fs_truncate_blocks_enter: dev = (7,0), ino = 8, i_size = 0, i_blocks = 24, start file offset = 0
f2fs_truncate_blocks_exit: dev = (7,0), ino = 8, ret = -2
The root cause is: in the fuzzed image, dnode kernel-patches#8 belongs to inode kernel-patches#7,
after inode kernel-patches#7 eviction, dnode kernel-patches#8 was dropped.
However there is dirent that has ino kernel-patches#8, so, once we unlink file3, in
f2fs_evict_inode(), both f2fs_truncate() and f2fs_update_inode_page()
will fail due to we can not load node kernel-patches#8, result in we missed to call
f2fs_inode_synced() to clear inode dirty status.
Let's fix this by calling f2fs_inode_synced() in error path of
f2fs_evict_inode().
PS: As I verified, the reproducer [2] can trigger this bug in v6.1.129,
but it failed in v6.16-rc4, this is because the testcase will stop due to
other corruption has been detected by f2fs:
F2FS-fs (loop0): inconsistent node block, node_type:2, nid:8, node_footer[nid:8,ino:8,ofs:0,cpver:5013063228981249506,blkaddr:15366]
F2FS-fs (loop0): f2fs_lookup: inode (ino=9) has zero i_nlink
Fixes: 0f18b46 ("f2fs: flush inode metadata when checkpoint is doing")
Closes: https://syzkaller.appspot.com/x/report.txt?x=13448368580000
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Ido Schimmel says: ==================== ipv4: icmp: Fix source IP derivation in presence of VRFs Align IPv4 with IPv6 and in the presence of VRFs generate ICMP error messages with a source IP that is derived from the receiving interface and not from its VRF master. This is especially important when the error messages are "Time Exceeded" messages as it means that utilities like traceroute will show an incorrect packet path. Patches kernel-patches#1-kernel-patches#2 are preparations. Patch kernel-patches#3 is the actual change. Patches kernel-patches#4-kernel-patches#7 make small improvements in the existing traceroute test. Patch kernel-patches#8 extends the traceroute test with VRF test cases for both IPv4 and IPv6. Changes since v1 [1]: * Rebase. [1] https://lore.kernel.org/netdev/20250901083027.183468-1-idosch@nvidia.com/ ==================== Link: https://patch.msgid.link/20250908073238.119240-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Machata says: ==================== bridge: Allow keeping local FDB entries only on VLAN 0 The bridge FDB contains one local entry per port per VLAN, for the MAC of the port in question, and likewise for the bridge itself. This allows bridge to locally receive and punt "up" any packets whose destination MAC address matches that of one of the bridge interfaces or of the bridge itself. The number of these local "service" FDB entries grows linearly with number of bridge-global VLAN memberships, but that in turn will tend to grow quadratically with number of ports and per-port VLAN memberships. While that does not cause issues during forwarding lookups, it does make dumps impractically slow. As an example, with 100 interfaces, each on 4K VLANs, a full dump of FDB that just contains these 400K local entries, takes 6.5s. That's _without_ considering iproute2 formatting overhead, this is just how long it takes to walk the FDB (repeatedly), serialize it into netlink messages, and parse the messages back in userspace. This is to illustrate that with growing number of ports and VLANs, the time required to dump this repetitive information blows up. Arguably 4K VLANs per interface is not a very realistic configuration, but then modern switches can instead have several hundred interfaces, and we have fielded requests for >1K VLAN memberships per port among customers. FDB entries are currently all kept on a single linked list, and then dumping uses this linked list to walk all entries and dump them in order. When the message buffer is full, the iteration is cut short, and later restarted. Of course, to restart the iteration, it's first necessary to walk the already-dumped front part of the list before starting dumping again. So one possibility is to organize the FDB entries in different structure more amenable to walk restarts. One option is to walk directly the hash table. The advantage is that no auxiliary structure needs to be introduced. With a rough sketch of this approach, the above scenario gets dumped in not quite 3 s, saving over 50 % of time. However hash table iteration requires maintaining an active cursor that must be collected when the dump is aborted. It looks like that would require changes in the NDO protocol to allow to run this cleanup. Moreover, on hash table resize the iteration is simply restarted. FDB dumps are currently not guaranteed to correspond to any one particular state: entries can be missed, or be duplicated. But with hash table iteration we would get that plus the much less graceful resize behavior, where swaths of FDB are duplicated. Another option is to maintain the FDB entries in a red-black tree. We have a PoC of this approach on hand, and the above scenario is dumped in about 2.5 s. Still not as snappy as we'd like it, but better than the hash table. However the savings come at the expense of a more expensive insertion, and require locking during dumps, which blocks insertion. The upside of these approaches is that they provide benefits whatever the FDB contents. But it does not seem like either of these is workable. However we intend to clean up the RB tree PoC and present it for consideration later on in case the trade-offs are considered acceptable. Yet another option might be to use in-kernel FDB filtering, and to filter the local entries when dumping. Unfortunately, this does not help all that much either, because the linked-list walk still needs to happen. Also, with the obvious filtering interface built around ndm_flags / ndm_state filtering, one can't just exclude pure local entries in one query. One needs to dump all non-local entries first, and then to get permanent entries in another run filter local & added_by_user. I.e. one needs to pay the iteration overhead twice, and then integrate the result in userspace. To get significant savings, one would need a very specific knob like "dump, but skip/only include local entries". But if we are adding a local-specific knobs, maybe let's have an option to just not duplicate them in the first place. All this FDB duplication is there merely to make things snappy during forwarding. But high-radix switches with thousands of VLANs typically do not process much traffic in the SW datapath at all, but rather offload vast majority of it. So we could exchange some of the runtime performance for a neater FDB. To that end, in this patchset, introduce a new bridge option, BR_BOOLOPT_FDB_LOCAL_VLAN_0, which when enabled, has local FDB entries installed only on VLAN 0, instead of duplicating them across all VLANs. Then to maintain the local termination behavior, on FDB miss, the bridge does a second lookup on VLAN 0. Enabling this option changes the bridge behavior in expected ways. Since the entries are only kept on VLAN 0, FDB get, flush and dump will not perceive them on non-0 VLANs. And deleting the VLAN 0 entry affects forwarding on all VLANs. This patchset is loosely based on a privately circulated patch by Nikolay Aleksandrov. The patchset progresses as follows: - Patch kernel-patches#1 introduces a bridge option to enable the above feature. Then patches kernel-patches#2 to kernel-patches#5 gradually patch the bridge to do the right thing when the option is enabled. Finally patch kernel-patches#6 adds the UAPI knob and the code for when the feature is enabled or disabled. - Patches kernel-patches#7, kernel-patches#8 and kernel-patches#9 contain fixes and improvements to selftest libraries - Patch kernel-patches#10 contains a new selftest ==================== Link: https://patch.msgid.link/cover.1757004393.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
…CAN XL step 3/3"
Vincent Mailhol <mailhol@kernel.org> says:
In November last year, I sent an RFC to introduce CAN XL [1]. That
RFC, despite positive feedback, was put on hold due to some unanswered
question concerning the PWM encoding [2].
While stuck, some small preparation work was done in parallel in [3]
by refactoring the struct can_priv and doing some trivial clean-up and
renaming. Initially, [3] received zero feedback but was eventually
merged after splitting it in smaller parts and resending it.
Finally, in July this year, we clarified the remaining mysteries about
PWM calculation, thus unlocking the series. Summer being a bit busy
because of some personal matters brings us to now.
After doing all the refactoring and adding all the CAN XL features,
the final result is more than 30 patches, definitively too much for a
single series. So I am splitting the remaining changes three:
- can: rework the CAN MTU logic [4]
- can: netlink: preparation before introduction of CAN XL (this series)
- CAN XL (will come right after the two preparation series get merged)
And thus, this series continues and finishes the preparation work done
in [3] and [4]. It contains all the refactoring needed to smoothly
introduce CAN XL. The goal is to:
- split the functions in smaller pieces: CAN XL will introduce a
fair amount of code. And some functions which are already fairly
long (86 lines for can_validate(), 215 lines for can_changelink())
would grow to disproportionate sizes if the CAN XL logic were to
be inlined in those functions.
- repurpose the existing code to handle both CAN FD and CAN XL: a
huge part of CAN XL simply reuses the CAN FD logic. All the
existing CAN FD logic is made more generic to handle both CAN FD
and XL.
In more details:
- Patch kernel-patches#1 moves struct data_bittiming_params from dev.h to
bittiming.h and patch kernel-patches#2 makes can_get_relative_tdco() FD agnostic
before also moving it to bittiming.h.
- Patch kernel-patches#3 adds some comments to netlink.h tagging which IFLA
symbols are FD specific.
- Patches kernel-patches#4 to kernel-patches#6 are refactoring can_validate() and
can_validate_bittiming().
- Patches kernel-patches#7 to kernel-patches#11 are refactoring can_changelink() and
can_tdc_changelink().
- Patches kernel-patches#12 and kernel-patches#13 are refactoring can_get_size() and
can_tdc_get_size().
- Patches kernel-patches#14 to kernel-patches#17 are refactoring can_fill_info() and
can_tdc_fill_info().
- Patch kernel-patches#18 makes can_calc_tdco() FD agnostic.
- Patch kernel-patches#19 adds can_get_ctrlmode_str() which converts control mode
flags into strings. This is done in preparation of patch kernel-patches#20.
- Patch kernel-patches#20 is the final patch and improves the user experience by
providing detailed error messages whenever invalid parameters are
provided. All those error messages came into handy when debugging
the upcoming CAN XL patches.
Aside from the last patch, the other changes do not impact any of the
existing functionalities.
The follow up series which introduces CAN XL is nearly completed but
will be sent only once this one is approved: one thing at a time, I do
not want to overwhelm people (including myself).
[1] https://lore.kernel.org/linux-can/20241110155902.72807-16-mailhol.vincent@wanadoo.fr/
[2] https://lore.kernel.org/linux-can/c4771c16-c578-4a6d-baee-918fe276dbe9@wanadoo.fr/
[3] https://lore.kernel.org/linux-can/20241110155902.72807-16-mailhol.vincent@wanadoo.fr/
[4] https://lore.kernel.org/linux-can/20250923-can-fix-mtu-v2-0-984f9868db69@kernel.org/
Link: https://patch.msgid.link/20250923-canxl-netlink-prep-v4-0-e720d28f66fe@kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Before disabling SR-IOV via config space accesses to the parent PF, sriov_disable() first removes the PCI devices representing the VFs. Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()") such removal operations are serialized against concurrent remove and rescan using the pci_rescan_remove_lock. No such locking was ever added in sriov_disable() however. In particular when commit 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device removal into sriov_del_vfs() there was still no locking around the pci_iov_remove_virtfn() calls. On s390 the lack of serialization in sriov_disable() may cause double remove and list corruption with the below (amended) trace being observed: PSW: 0704c00180000000 0000000c914e4b38 (klist_put+56) GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001 00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480 0000000000000001 0000000000000000 0000000000000000 0000000180692828 00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8 #0 [3800313fb20] device_del at c9158ad5c #1 [3800313fb88] pci_remove_bus_device at c915105ba #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198 #3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0 #4 [3800313fc60] zpci_bus_remove_device at c90fb6104 #5 [3800313fca0] __zpci_event_availability at c90fb3dca #6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2 #7 [3800313fd60] crw_collect_info at c91905822 #8 [3800313fe10] kthread at c90feb390 #9 [3800313fe68] __ret_from_fork at c90f6aa64 #10 [3800313fe98] ret_from_fork at c9194f3f2. This is because in addition to sriov_disable() removing the VFs, the platform also generates hot-unplug events for the VFs. This being the reverse operation to the hotplug events generated by sriov_enable() and handled via pdev->no_vf_scan. And while the event processing takes pci_rescan_remove_lock and checks whether the struct pci_dev still exists, the lack of synchronization makes this checking racy. Other races may also be possible of course though given that this lack of locking persisted so long observable races seem very rare. Even on s390 the list corruption was only observed with certain devices since the platform events are only triggered by config accesses after the removal, so as long as the removal finished synchronously they would not race. Either way the locking is missing so fix this by adding it to the sriov_del_vfs() helper. Just like PCI rescan-remove, locking is also missing in sriov_add_vfs() including for the error case where pci_stop_and_remove_bus_device() is called without the PCI rescan-remove lock being held. Even in the non-error case, adding new PCI devices and buses should be serialized via the PCI rescan-remove lock. Add the necessary locking. Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Reviewed-by: Julian Ruess <julianr@linux.ibm.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250826-pci_fix_sriov_disable-v1-1-2d0bc938f2a3@linux.ibm.com
When s_start() fails to allocate memory for set_event_iter, it returns NULL before acquiring event_mutex. However, the corresponding s_stop() function always tries to unlock the mutex, causing a lock imbalance warning: WARNING: bad unlock balance detected! 6.17.0-rc7-00175-g2b2e0c04f78c #7 Not tainted ------------------------------------- syz.0.85611/376514 is trying to release lock (event_mutex) at: [<ffffffff8dafc7a4>] traverse.part.0.constprop.0+0x2c4/0x650 fs/seq_file.c:131 but there are no more locks to release! The issue was introduced by commit b355247 ("tracing: Cache ':mod:' events for modules not loaded yet") which added the kzalloc() allocation before the mutex lock, creating a path where s_start() could return without locking the mutex while s_stop() would still try to unlock it. Fix this by unconditionally acquiring the mutex immediately after allocation, regardless of whether the allocation succeeded. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250929113238.3722055-1-sashal@kernel.org Fixes: b355247 ("tracing: Cache ":mod:" events for modules not loaded yet") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull request for series with
subject: tools: bpftool: support creating and dumping outer maps
version: 1
url: https://patchwork.ozlabs.org/project/netdev/list/?series=199591