forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 1
[CI run] bpf: Set flow flag to allow any source IP in bpf_tunnel_key #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit extends the ip_tunnel_key struct with a new field for the flow flags, to pass them to the route lookups. This new field will be populated and used in subsequent commits. Signed-off-by: Paul Chaignon <paul@isovalent.com>
Use the new ip_tunnel_key field with the flow flags in the route lookups for the encapsulated packet. This will be used by the bpf_skb_set_tunnel_key helper in a subsequent commit. Signed-off-by: Paul Chaignon <paul@isovalent.com>
Use the new ip_tunnel_key field with the flow flags in the route lookups for the encapsulated packet. This will be used by the bpf_skb_set_tunnel_key helper in the subsequent commit. Signed-off-by: Paul Chaignon <paul@isovalent.com>
Commit 26101f5 ("bpf: Add source ip in "struct bpf_tunnel_key"") added support for getting and setting the outer source IP of encapsulated packets via the bpf_skb_{get,set}_tunnel_key BPF helper. This change allows BPF programs to set any IP address as the source, including for example the IP address of a container running on the same host. In that last case, however, the encapsulated packets are dropped when looking up the route because the source IP address isn't assigned to any interface on the host. To avoid this, we need to set the FLOWI_FLAG_ANYSRC flag. Fixes: 26101f5 ("bpf: Add source ip in "struct bpf_tunnel_key"") Signed-off-by: Paul Chaignon <paul@isovalent.com>
be4c339
to
72f75ae
Compare
pchaigno
pushed a commit
that referenced
this pull request
Jul 25, 2022
…tion Each cset (css_set) is pinned by its tasks. When we're moving tasks around across csets for a migration, we need to hold the source and destination csets to ensure that they don't go away while we're moving tasks about. This is done by linking cset->mg_preload_node on either the mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the same cset->mg_preload_node for both the src and dst lists was deemed okay as a cset can't be both the source and destination at the same time. Unfortunately, this overloading becomes problematic when multiple tasks are involved in a migration and some of them are identity noop migrations while others are actually moving across cgroups. For example, this can happen with the following sequence on cgroup1: #1> mkdir -p /sys/fs/cgroup/misc/a/b #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS & #4> PID=$! #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks #6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs the process including the group leader back into a. In this final migration, non-leader threads would be doing identity migration while the group leader is doing an actual one. After #3, let's say the whole process was in cset A, and that after #4, the leader moves to cset B. Then, during #6, the following happens: 1. cgroup_migrate_add_src() is called on B for the leader. 2. cgroup_migrate_add_src() is called on A for the other threads. 3. cgroup_migrate_prepare_dst() is called. It scans the src list. 4. It notices that B wants to migrate to A, so it tries to A to the dst list but realizes that its ->mg_preload_node is already busy. 5. and then it notices A wants to migrate to A as it's an identity migration, it culls it by list_del_init()'ing its ->mg_preload_node and putting references accordingly. 6. The rest of migration takes place with B on the src list but nothing on the dst list. This means that A isn't held while migration is in progress. If all tasks leave A before the migration finishes and the incoming task pins it, the cset will be destroyed leading to use-after-free. This is caused by overloading cset->mg_preload_node for both src and dst preload lists. We wanted to exclude the cset from the src list but ended up inadvertently excluding it from the dst list too. This patch fixes the issue by separating out cset->mg_preload_node into ->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst preloadings don't interfere with each other. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Mukesh Ojha <quic_mojha@quicinc.com> Reported-by: shisiyuan <shisiyuan19870131@gmail.com> Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com Link: https://www.spinics.net/lists/cgroups/msg33313.html Fixes: f817de9 ("cgroup: prepare migration path for unified hierarchy") Cc: stable@vger.kernel.org # v3.16+
pchaigno
pushed a commit
that referenced
this pull request
Jul 25, 2022
…0.3.7 On the GC 10.3.7 platform the initial MEC release version #3 can support atomic operation,so need correct and set its MEC atomic support version to #3. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Aaron Liu <aaron.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 5.18.x
pchaigno
pushed a commit
that referenced
this pull request
Jul 25, 2022
On powerpc, 'perf trace' is crashing with a SIGSEGV when trying to process a perf.data file created with 'perf trace record -p': #0 0x00000001225b8988 in syscall_arg__scnprintf_augmented_string <snip> at builtin-trace.c:1492 #1 syscall_arg__scnprintf_filename <snip> at builtin-trace.c:1492 #2 syscall_arg__scnprintf_filename <snip> at builtin-trace.c:1486 #3 0x00000001225bdd9c in syscall_arg_fmt__scnprintf_val <snip> at builtin-trace.c:1973 #4 syscall__scnprintf_args <snip> at builtin-trace.c:2041 #5 0x00000001225bff04 in trace__sys_enter <snip> at builtin-trace.c:2319 That points to the below code in tools/perf/builtin-trace.c: /* * If this is raw_syscalls.sys_enter, then it always comes with the 6 possible * arguments, even if the syscall being handled, say "openat", uses only 4 arguments * this breaks syscall__augmented_args() check for augmented args, as we calculate * syscall->args_size using each syscalls:sys_enter_NAME tracefs format file, * so when handling, say the openat syscall, we end up getting 6 args for the * raw_syscalls:sys_enter event, when we expected just 4, we end up mistakenly * thinking that the extra 2 u64 args are the augmented filename, so just check * here and avoid using augmented syscalls when the evsel is the raw_syscalls one. */ if (evsel != trace->syscalls.events.sys_enter) augmented_args = syscall__augmented_args(sc, sample, &augmented_args_size, trace->raw_augmented_syscalls_args_size); As the comment points out, we should not be trying to augment the args for raw_syscalls. However, when processing a perf.data file, we are not initializing those properly. Fix the same. Reported-by: Claudio Carvalho <cclaudio@linux.ibm.com> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/lkml/20220707090900.572584-1-naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
pchaigno
pushed a commit
that referenced
this pull request
Jul 25, 2022
Add a lock_class_key per devlink instance to avoid DEADLOCK warning by lockdep, while locking more than one devlink instance in driver code, for example in opening VFs flow. Kernel log: [ 101.433802] ============================================ [ 101.433803] WARNING: possible recursive locking detected [ 101.433810] 5.19.0-rc1+ torvalds#35 Not tainted [ 101.433812] -------------------------------------------- [ 101.433813] bash/892 is trying to acquire lock: [ 101.433815] ffff888127bfc2f8 (&devlink->lock){+.+.}-{3:3}, at: probe_one+0x3c/0x690 [mlx5_core] [ 101.433909] but task is already holding lock: [ 101.433910] ffff888118f4c2f8 (&devlink->lock){+.+.}-{3:3}, at: mlx5_core_sriov_configure+0x62/0x280 [mlx5_core] [ 101.433989] other info that might help us debug this: [ 101.433990] Possible unsafe locking scenario: [ 101.433991] CPU0 [ 101.433991] ---- [ 101.433992] lock(&devlink->lock); [ 101.433993] lock(&devlink->lock); [ 101.433995] *** DEADLOCK *** [ 101.433996] May be due to missing lock nesting notation [ 101.433996] 6 locks held by bash/892: [ 101.433998] #0: ffff88810eb50448 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0xf3/0x1d0 [ 101.434009] #1: ffff888114777c88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x20d/0x520 [ 101.434017] #2: ffff888102b58660 (kn->active#231){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x230/0x520 [ 101.434023] #3: ffff888102d70198 (&dev->mutex){....}-{3:3}, at: sriov_numvfs_store+0x132/0x310 [ 101.434031] #4: ffff888118f4c2f8 (&devlink->lock){+.+.}-{3:3}, at: mlx5_core_sriov_configure+0x62/0x280 [mlx5_core] [ 101.434108] #5: ffff88812adce198 (&dev->mutex){....}-{3:3}, at: __device_attach+0x76/0x430 [ 101.434116] stack backtrace: [ 101.434118] CPU: 5 PID: 892 Comm: bash Not tainted 5.19.0-rc1+ torvalds#35 [ 101.434120] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 101.434130] Call Trace: [ 101.434133] <TASK> [ 101.434135] dump_stack_lvl+0x57/0x7d [ 101.434145] __lock_acquire.cold+0x1df/0x3e7 [ 101.434151] ? register_lock_class+0x1880/0x1880 [ 101.434157] lock_acquire+0x1c1/0x550 [ 101.434160] ? probe_one+0x3c/0x690 [mlx5_core] [ 101.434229] ? lockdep_hardirqs_on_prepare+0x400/0x400 [ 101.434232] ? __xa_alloc+0x1ed/0x2d0 [ 101.434236] ? ksys_write+0xf3/0x1d0 [ 101.434239] __mutex_lock+0x12c/0x14b0 [ 101.434243] ? probe_one+0x3c/0x690 [mlx5_core] [ 101.434312] ? probe_one+0x3c/0x690 [mlx5_core] [ 101.434380] ? devlink_alloc_ns+0x11b/0x910 [ 101.434385] ? mutex_lock_io_nested+0x1320/0x1320 [ 101.434388] ? lockdep_init_map_type+0x21a/0x7d0 [ 101.434391] ? lockdep_init_map_type+0x21a/0x7d0 [ 101.434393] ? __init_swait_queue_head+0x70/0xd0 [ 101.434397] probe_one+0x3c/0x690 [mlx5_core] [ 101.434467] pci_device_probe+0x1b4/0x480 [ 101.434471] really_probe+0x1e0/0xaa0 [ 101.434474] __driver_probe_device+0x219/0x480 [ 101.434478] driver_probe_device+0x49/0x130 [ 101.434481] __device_attach_driver+0x1b8/0x280 [ 101.434484] ? driver_allows_async_probing+0x140/0x140 [ 101.434487] bus_for_each_drv+0x123/0x1a0 [ 101.434489] ? bus_for_each_dev+0x1a0/0x1a0 [ 101.434491] ? lockdep_hardirqs_on_prepare+0x286/0x400 [ 101.434494] ? trace_hardirqs_on+0x2d/0x100 [ 101.434498] __device_attach+0x1a3/0x430 [ 101.434501] ? device_driver_attach+0x1e0/0x1e0 [ 101.434503] ? pci_bridge_d3_possible+0x1e0/0x1e0 [ 101.434506] ? pci_create_resource_files+0xeb/0x190 [ 101.434511] pci_bus_add_device+0x6c/0xa0 [ 101.434514] pci_iov_add_virtfn+0x9e4/0xe00 [ 101.434517] ? trace_hardirqs_on+0x2d/0x100 [ 101.434521] sriov_enable+0x64a/0xca0 [ 101.434524] ? pcibios_sriov_disable+0x10/0x10 [ 101.434528] mlx5_core_sriov_configure+0xab/0x280 [mlx5_core] [ 101.434602] sriov_numvfs_store+0x20a/0x310 [ 101.434605] ? sriov_totalvfs_show+0xc0/0xc0 [ 101.434608] ? sysfs_file_ops+0x170/0x170 [ 101.434611] ? sysfs_file_ops+0x117/0x170 [ 101.434614] ? sysfs_file_ops+0x170/0x170 [ 101.434616] kernfs_fop_write_iter+0x348/0x520 [ 101.434619] new_sync_write+0x2e5/0x520 [ 101.434621] ? new_sync_read+0x520/0x520 [ 101.434624] ? lock_acquire+0x1c1/0x550 [ 101.434626] ? lockdep_hardirqs_on_prepare+0x400/0x400 [ 101.434630] vfs_write+0x5cb/0x8d0 [ 101.434633] ksys_write+0xf3/0x1d0 [ 101.434635] ? __x64_sys_read+0xb0/0xb0 [ 101.434638] ? lockdep_hardirqs_on_prepare+0x286/0x400 [ 101.434640] ? syscall_enter_from_user_mode+0x1d/0x50 [ 101.434643] do_syscall_64+0x3d/0x90 [ 101.434647] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 101.434650] RIP: 0033:0x7f5ff536b2f7 [ 101.434658] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 101.434661] RSP: 002b:00007ffd9ea85d58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 101.434664] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5ff536b2f7 [ 101.434666] RDX: 0000000000000002 RSI: 000055c4c279e230 RDI: 0000000000000001 [ 101.434668] RBP: 000055c4c279e230 R08: 000000000000000a R09: 0000000000000001 [ 101.434669] R10: 000055c4c283cbf0 R11: 0000000000000246 R12: 0000000000000002 [ 101.434670] R13: 00007f5ff543d500 R14: 0000000000000002 R15: 00007f5ff543d700 [ 101.434673] </TASK> Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pchaigno
pushed a commit
that referenced
this pull request
Jul 25, 2022
Biju Das says: ==================== Add support for RZ/N1 SJA1000 CAN controller This patch series aims to add support for RZ/N1 SJA1000 CAN controller. The SJA1000 CAN controller on RZ/N1 SoC has some differences compared to others like it has no clock divider register (CDR) support and it has no HW loopback (HW doesn't see tx messages on rx), so introduced a new compatible 'renesas,rzn1-sja1000' to handle these differences. v3->v4: * Updated bindings as per coding style used in example-schema. * Entire entry in properties compatible declared as enum. Also Descriptions do not bring any information,so removed it from compatible description. * Used decimal values in nxp,tx-output-mode enums. * Fixed indentaions in binding examples. * Removed clock-names from bindings, as it is single clock. * Optimized the code as per Vincent's suggestion. * Updated clock handling as per bindings. v2->v3: * Added reg-io-width is a required property for technologic,sja1000 & renesas,rzn1-sja1000 * Removed enum type from nxp,tx-output-config and updated the description for combination of TX0 and TX1. * Updated the example for technologic,sja1000 v1->v2: * Moved $ref: can-controller.yaml# to top along with if conditional to avoid multiple mapping issues with the if conditional in the subsequent patch. * Added an example for RZ/N1D SJA1000 usage. * Updated commit description for patch#2,#3 and #6 * Removed the quirk macro SJA1000_NO_HW_LOOPBACK_QUIRK * Added prefix SJA1000_QUIRK_* for quirk macro. * Replaced of_device_get_match_data->device_get_match_data. * Added error handling on clk error path * Started using "devm_clk_get_optional_enabled" for clk get,prepare and enable. Ref: [1] https://lore.kernel.org/linux-renesas-soc/20220701162320.102165-1-biju.das.jz@bp.renesas.com/T/#t ==================== Link: https://lore.kernel.org/all/20220710115248.190280-1-biju.das.jz@bp.renesas.com [mkl: applying patches 1...5 only, as 6 depends devm_clk_get_optional_enabled(), which is not in net-next/master, yet] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
aspsk
pushed a commit
that referenced
this pull request
Nov 30, 2022
…le recovery Commit 26f3a02 ("ath11k: allocate smaller chunks of memory for firmware") and commit f6f9296 ("ath11k: qmi: try to allocate a big block of DMA memory first") change ath11k to allocate the memory chunks for target twice while wlan load. It fails for the 1st time because of large memory and then changed to allocate many small chunks for the 2nd time sometimes as below log. 1st time failed: [10411.640620] ath11k_pci 0000:05:00.0: qmi firmware request memory request [10411.640625] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 6881280 [10411.640630] ath11k_pci 0000:05:00.0: qmi mem seg type 4 size 3784704 [10411.640658] ath11k_pci 0000:05:00.0: qmi dma allocation failed (6881280 B type 1), will try later with small size [10411.640671] ath11k_pci 0000:05:00.0: qmi delays mem_request 2 [10411.640677] ath11k_pci 0000:05:00.0: qmi respond memory request delayed 1 2nd time success: [10411.642004] ath11k_pci 0000:05:00.0: qmi firmware request memory request [10411.642008] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 524288 [10411.642012] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 524288 [10411.642014] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 524288 [10411.642016] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 524288 [10411.642018] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 524288 [10411.642020] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 524288 [10411.642022] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 524288 [10411.642024] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 524288 [10411.642027] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 524288 [10411.642029] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 524288 [10411.642031] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 458752 [10411.642033] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 131072 [10411.642035] ath11k_pci 0000:05:00.0: qmi mem seg type 4 size 524288 [10411.642037] ath11k_pci 0000:05:00.0: qmi mem seg type 4 size 524288 [10411.642039] ath11k_pci 0000:05:00.0: qmi mem seg type 4 size 524288 [10411.642041] ath11k_pci 0000:05:00.0: qmi mem seg type 4 size 524288 [10411.642043] ath11k_pci 0000:05:00.0: qmi mem seg type 4 size 524288 [10411.642045] ath11k_pci 0000:05:00.0: qmi mem seg type 4 size 524288 [10411.642047] ath11k_pci 0000:05:00.0: qmi mem seg type 4 size 491520 [10411.642049] ath11k_pci 0000:05:00.0: qmi mem seg type 1 size 524288 And then commit 5962f37 ("ath11k: Reuse the available memory after firmware reload") skip the ath11k_qmi_free_resource() which frees the memory chunks while recovery, after that, when run recovery test on WCN6855, a warning happened every time as below and finally leads fail for recovery. [ 159.570318] BUG: Bad page state in process kworker/u16:5 pfn:33300 [ 159.570320] page:0000000096ffdbb9 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x33300 [ 159.570324] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff) [ 159.570329] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000 [ 159.570332] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 [ 159.570334] page dumped because: nonzero _refcount [ 159.570440] firewire_ohci syscopyarea sysfillrect psmouse sdhci_pci ahci sysimgblt firewire_core fb_sys_fops libahci crc_itu_t cqhci drm sdhci e1000e wmi video [ 159.570460] CPU: 2 PID: 217 Comm: kworker/u16:5 Kdump: loaded Tainted: G B 5.19.0-rc1-wt-ath+ #3 [ 159.570465] Hardware name: LENOVO 418065C/418065C, BIOS 83ET63WW (1.33 ) 07/29/2011 [ 159.570467] Workqueue: qmi_msg_handler qmi_data_ready_work [qmi_helpers] [ 159.570475] Call Trace: [ 159.570476] <TASK> [ 159.570478] dump_stack_lvl+0x49/0x5f [ 159.570486] dump_stack+0x10/0x12 [ 159.570493] bad_page+0xab/0xf0 [ 159.570502] check_free_page_bad+0x66/0x70 [ 159.570511] __free_pages_ok+0x530/0x9a0 [ 159.570517] ? __dev_printk+0x58/0x6b [ 159.570525] ? _dev_printk+0x56/0x72 [ 159.570534] ? qmi_decode+0x119/0x470 [qmi_helpers] [ 159.570543] __free_pages+0x91/0xd0 [ 159.570548] dma_free_contiguous+0x50/0x60 [ 159.570556] dma_direct_free+0xe5/0x140 [ 159.570564] dma_free_attrs+0x35/0x50 [ 159.570570] ath11k_qmi_msg_mem_request_cb+0x2ae/0x3c0 [ath11k] [ 159.570620] qmi_invoke_handler+0xac/0xe0 [qmi_helpers] [ 159.570630] qmi_handle_message+0x6d/0x180 [qmi_helpers] [ 159.570643] qmi_data_ready_work+0x2ca/0x440 [qmi_helpers] [ 159.570656] process_one_work+0x227/0x440 [ 159.570667] worker_thread+0x31/0x3d0 [ 159.570676] ? process_one_work+0x440/0x440 [ 159.570685] kthread+0xfe/0x130 [ 159.570692] ? kthread_complete_and_exit+0x20/0x20 [ 159.570701] ret_from_fork+0x22/0x30 [ 159.570712] </TASK> The reason is because when wlan start to recovery, the type, size and count is not same for the 1st and 2nd QMI_WLFW_REQUEST_MEM_IND message, Then it leads the parameter size is not correct for the dma_free_coherent(). For the chunk[1], the actual dma size is 524288 which allocate in the 2nd time of the initial wlan load phase, and the size which pass to dma_free_coherent() is 3784704 which is got in the 1st time of recovery phase, then warning above happened. Change to use prev_size of struct target_mem_chunk for the paramter of dma_free_coherent() since prev_size is the real size of last load/recovery. Also change to check both type and size of struct target_mem_chunk to reuse the memory to avoid mismatch buffer size for target. Then the warning disappear and recovery success. When the 1st QMI_WLFW_REQUEST_MEM_IND for recovery arrived, the trunk[0] is freed in ath11k_qmi_alloc_target_mem_chunk() and then dma_alloc_coherent() failed caused by large size, and then trunk[1] is freed in ath11k_qmi_free_target_mem_chunk(), the left 18 trunks will be reuse for the 2nd QMI_WLFW_REQUEST_MEM_IND message. Tested-on: WCN6855 hw2.0 PCI WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3 Fixes: 5962f37 ("ath11k: Reuse the available memory after firmware reload") Signed-off-by: Wen Gong <quic_wgong@quicinc.com> Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com> Link: https://lore.kernel.org/r/20220928073832.16251-1-quic_wgong@quicinc.com
aspsk
pushed a commit
that referenced
this pull request
Nov 30, 2022
Syzbot reported the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Not tainted ------------------------------------------------------ syz-executor307/3029 is trying to acquire lock: ffff0000c02525d8 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x54/0xb4 mm/memory.c:5576 but task is already holding lock: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (btrfs-root-00){++++}-{3:3}: down_read_nested+0x64/0x84 kernel/locking/rwsem.c:1624 __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 btrfs_search_slot_get_root+0x74/0x338 fs/btrfs/ctree.c:1637 btrfs_search_slot+0x1b0/0xfd8 fs/btrfs/ctree.c:1944 btrfs_update_root+0x6c/0x5a0 fs/btrfs/root-tree.c:132 commit_fs_roots+0x1f0/0x33c fs/btrfs/transaction.c:1459 btrfs_commit_transaction+0x89c/0x12d8 fs/btrfs/transaction.c:2343 flush_space+0x66c/0x738 fs/btrfs/space-info.c:786 btrfs_async_reclaim_metadata_space+0x43c/0x4e0 fs/btrfs/space-info.c:1059 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289 worker_thread+0x340/0x610 kernel/workqueue.c:2436 kthread+0x12c/0x158 kernel/kthread.c:376 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860 -> #2 (&fs_info->reloc_mutex){+.+.}-{3:3}: __mutex_lock_common+0xd4/0xca8 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x38/0x44 kernel/locking/mutex.c:799 btrfs_record_root_in_trans fs/btrfs/transaction.c:516 [inline] start_transaction+0x248/0x944 fs/btrfs/transaction.c:752 btrfs_start_transaction+0x34/0x44 fs/btrfs/transaction.c:781 btrfs_create_common+0xf0/0x1b4 fs/btrfs/inode.c:6651 btrfs_create+0x8c/0xb0 fs/btrfs/inode.c:6697 lookup_open fs/namei.c:3413 [inline] open_last_lookups fs/namei.c:3481 [inline] path_openat+0x804/0x11c4 fs/namei.c:3688 do_filp_open+0xdc/0x1b8 fs/namei.c:3718 do_sys_openat2+0xb8/0x22c fs/open.c:1313 do_sys_open fs/open.c:1329 [inline] __do_sys_openat fs/open.c:1345 [inline] __se_sys_openat fs/open.c:1340 [inline] __arm64_sys_openat+0xb0/0xe0 fs/open.c:1340 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #1 (sb_internal#2){.+.+}-{0:0}: percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1826 [inline] sb_start_intwrite include/linux/fs.h:1948 [inline] start_transaction+0x360/0x944 fs/btrfs/transaction.c:683 btrfs_join_transaction+0x30/0x40 fs/btrfs/transaction.c:795 btrfs_dirty_inode+0x50/0x140 fs/btrfs/inode.c:6103 btrfs_update_time+0x1c0/0x1e8 fs/btrfs/inode.c:6145 inode_update_time fs/inode.c:1872 [inline] touch_atime+0x1f0/0x4a8 fs/inode.c:1945 file_accessed include/linux/fs.h:2516 [inline] btrfs_file_mmap+0x50/0x88 fs/btrfs/file.c:2407 call_mmap include/linux/fs.h:2192 [inline] mmap_region+0x7fc/0xc14 mm/mmap.c:1752 do_mmap+0x644/0x97c mm/mmap.c:1540 vm_mmap_pgoff+0xe8/0x1d0 mm/util.c:552 ksys_mmap_pgoff+0x1cc/0x278 mm/mmap.c:1586 __do_sys_mmap arch/arm64/kernel/sys.c:28 [inline] __se_sys_mmap arch/arm64/kernel/sys.c:21 [inline] __arm64_sys_mmap+0x58/0x6c arch/arm64/kernel/sys.c:21 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #0 (&mm->mmap_lock){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 other info that might help us debug this: Chain exists of: &mm->mmap_lock --> &fs_info->reloc_mutex --> btrfs-root-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-root-00); lock(&fs_info->reloc_mutex); lock(btrfs-root-00); lock(&mm->mmap_lock); *** DEADLOCK *** 1 lock held by syz-executor307/3029: #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 stack backtrace: CPU: 0 PID: 3029 Comm: syz-executor307 Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022 Call trace: dump_backtrace+0x1c4/0x1f0 arch/arm64/kernel/stacktrace.c:156 show_stack+0x2c/0x54 arch/arm64/kernel/stacktrace.c:163 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x104/0x16c lib/dump_stack.c:106 dump_stack+0x1c/0x58 lib/dump_stack.c:113 print_circular_bug+0x2c4/0x2c8 kernel/locking/lockdep.c:2053 check_noncircular+0x14c/0x154 kernel/locking/lockdep.c:2175 check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 We do generally the right thing here, copying the references into a temporary buffer, however we are still holding the path when we do copy_to_user from the temporary buffer. Fix this by freeing the path before we copy to user space. Reported-by: syzbot+4ef9e52e464c6ff47d9d@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
aspsk
pushed a commit
that referenced
this pull request
Nov 30, 2022
ldev->lock is used to serialize lag change operations. Since multiport eswtich functionality was added, we now change the mode dynamically. However, acquiring ldev->lock is not allowed as it could possibly lead to a deadlock as reported by the lockdep mechanism. [ 836.154963] WARNING: possible circular locking dependency detected [ 836.155850] 5.19.0-rc5_net_56b7df2 #1 Not tainted [ 836.156549] ------------------------------------------------------ [ 836.157418] handler1/12198 is trying to acquire lock: [ 836.158178] ffff888187d52b58 (&ldev->lock){+.+.}-{3:3}, at: mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.159575] [ 836.159575] but task is already holding lock: [ 836.160474] ffff8881d4de2930 (&block->cb_lock){++++}-{3:3}, at: tc_setup_cb_add+0x5b/0x200 [ 836.161669] which lock already depends on the new lock. [ 836.162905] [ 836.162905] the existing dependency chain (in reverse order) is: [ 836.164008] -> #3 (&block->cb_lock){++++}-{3:3}: [ 836.164946] down_write+0x25/0x60 [ 836.165548] tcf_block_get_ext+0x1c6/0x5d0 [ 836.166253] ingress_init+0x74/0xa0 [sch_ingress] [ 836.167028] qdisc_create.constprop.0+0x130/0x5e0 [ 836.167805] tc_modify_qdisc+0x481/0x9f0 [ 836.168490] rtnetlink_rcv_msg+0x16e/0x5a0 [ 836.169189] netlink_rcv_skb+0x4e/0xf0 [ 836.169861] netlink_unicast+0x190/0x250 [ 836.170543] netlink_sendmsg+0x243/0x4b0 [ 836.171226] sock_sendmsg+0x33/0x40 [ 836.171860] ____sys_sendmsg+0x1d1/0x1f0 [ 836.172535] ___sys_sendmsg+0xab/0xf0 [ 836.173183] __sys_sendmsg+0x51/0x90 [ 836.173836] do_syscall_64+0x3d/0x90 [ 836.174471] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 836.175282] [ 836.175282] -> #2 (rtnl_mutex){+.+.}-{3:3}: [ 836.176190] __mutex_lock+0x6b/0xf80 [ 836.176830] register_netdevice_notifier+0x21/0x120 [ 836.177631] rtnetlink_init+0x2d/0x1e9 [ 836.178289] netlink_proto_init+0x163/0x179 [ 836.178994] do_one_initcall+0x63/0x300 [ 836.179672] kernel_init_freeable+0x2cb/0x31b [ 836.180403] kernel_init+0x17/0x140 [ 836.181035] ret_from_fork+0x1f/0x30 [ 836.181687] -> #1 (pernet_ops_rwsem){+.+.}-{3:3}: [ 836.182628] down_write+0x25/0x60 [ 836.183235] unregister_netdevice_notifier+0x1c/0xb0 [ 836.184029] mlx5_ib_roce_cleanup+0x94/0x120 [mlx5_ib] [ 836.184855] __mlx5_ib_remove+0x35/0x60 [mlx5_ib] [ 836.185637] mlx5_eswitch_unregister_vport_reps+0x22f/0x440 [mlx5_core] [ 836.186698] auxiliary_bus_remove+0x18/0x30 [ 836.187409] device_release_driver_internal+0x1f6/0x270 [ 836.188253] bus_remove_device+0xef/0x160 [ 836.188939] device_del+0x18b/0x3f0 [ 836.189562] mlx5_rescan_drivers_locked+0xd6/0x2d0 [mlx5_core] [ 836.190516] mlx5_lag_remove_devices+0x69/0xe0 [mlx5_core] [ 836.191414] mlx5_do_bond_work+0x441/0x620 [mlx5_core] [ 836.192278] process_one_work+0x25c/0x590 [ 836.192963] worker_thread+0x4f/0x3d0 [ 836.193609] kthread+0xcb/0xf0 [ 836.194189] ret_from_fork+0x1f/0x30 [ 836.194826] -> #0 (&ldev->lock){+.+.}-{3:3}: [ 836.195734] __lock_acquire+0x15b8/0x2a10 [ 836.196426] lock_acquire+0xce/0x2d0 [ 836.197057] __mutex_lock+0x6b/0xf80 [ 836.197708] mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.198575] tc_act_parse_mirred+0x25b/0x800 [mlx5_core] [ 836.199467] parse_tc_actions+0x168/0x5a0 [mlx5_core] [ 836.200340] __mlx5e_add_fdb_flow+0x263/0x480 [mlx5_core] [ 836.201241] mlx5e_configure_flower+0x8a0/0x1820 [mlx5_core] [ 836.202187] tc_setup_cb_add+0xd7/0x200 [ 836.202856] fl_hw_replace_filter+0x14c/0x1f0 [cls_flower] [ 836.203739] fl_change+0xbbe/0x1730 [cls_flower] [ 836.204501] tc_new_tfilter+0x407/0xd90 [ 836.205168] rtnetlink_rcv_msg+0x406/0x5a0 [ 836.205877] netlink_rcv_skb+0x4e/0xf0 [ 836.206535] netlink_unicast+0x190/0x250 [ 836.207217] netlink_sendmsg+0x243/0x4b0 [ 836.207915] sock_sendmsg+0x33/0x40 [ 836.208538] ____sys_sendmsg+0x1d1/0x1f0 [ 836.209219] ___sys_sendmsg+0xab/0xf0 [ 836.209878] __sys_sendmsg+0x51/0x90 [ 836.210510] do_syscall_64+0x3d/0x90 [ 836.211137] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 836.211954] other info that might help us debug this: [ 836.213174] Chain exists of: [ 836.213174] &ldev->lock --> rtnl_mutex --> &block->cb_lock 836.214650] Possible unsafe locking scenario: [ 836.214650] [ 836.215574] CPU0 CPU1 [ 836.216255] ---- ---- [ 836.216943] lock(&block->cb_lock); [ 836.217518] lock(rtnl_mutex); [ 836.218348] lock(&block->cb_lock); [ 836.219212] lock(&ldev->lock); [ 836.219758] [ 836.219758] *** DEADLOCK *** [ 836.219758] [ 836.220747] 2 locks held by handler1/12198: [ 836.221390] #0: ffff8881d4de2930 (&block->cb_lock){++++}-{3:3}, at: tc_setup_cb_add+0x5b/0x200 [ 836.222646] #1: ffff88810c9a92c0 (&esw->mode_lock){++++}-{3:3}, at: mlx5_esw_hold+0x39/0x50 [mlx5_core] [ 836.224063] stack backtrace: [ 836.224799] CPU: 6 PID: 12198 Comm: handler1 Not tainted 5.19.0-rc5_net_56b7df2 #1 [ 836.225923] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 836.227476] Call Trace: [ 836.227929] <TASK> [ 836.228332] dump_stack_lvl+0x57/0x7d [ 836.228924] check_noncircular+0x104/0x120 [ 836.229562] __lock_acquire+0x15b8/0x2a10 [ 836.230201] lock_acquire+0xce/0x2d0 [ 836.230776] ? mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.231614] ? find_held_lock+0x2b/0x80 [ 836.232221] __mutex_lock+0x6b/0xf80 [ 836.232799] ? mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.233636] ? mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.234451] ? xa_load+0xc3/0x190 [ 836.234995] mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.235803] tc_act_parse_mirred+0x25b/0x800 [mlx5_core] [ 836.236636] ? tc_act_can_offload_mirred+0x135/0x210 [mlx5_core] [ 836.237550] parse_tc_actions+0x168/0x5a0 [mlx5_core] [ 836.238364] __mlx5e_add_fdb_flow+0x263/0x480 [mlx5_core] [ 836.239202] mlx5e_configure_flower+0x8a0/0x1820 [mlx5_core] [ 836.240076] ? lock_acquire+0xce/0x2d0 [ 836.240668] ? tc_setup_cb_add+0x5b/0x200 [ 836.241294] tc_setup_cb_add+0xd7/0x200 [ 836.241917] fl_hw_replace_filter+0x14c/0x1f0 [cls_flower] [ 836.242709] fl_change+0xbbe/0x1730 [cls_flower] [ 836.243408] tc_new_tfilter+0x407/0xd90 [ 836.244043] ? tc_del_tfilter+0x880/0x880 [ 836.244672] rtnetlink_rcv_msg+0x406/0x5a0 [ 836.245310] ? netlink_deliver_tap+0x7a/0x4b0 [ 836.245991] ? if_nlmsg_stats_size+0x2b0/0x2b0 [ 836.246675] netlink_rcv_skb+0x4e/0xf0 [ 836.258046] netlink_unicast+0x190/0x250 [ 836.258669] netlink_sendmsg+0x243/0x4b0 [ 836.259288] sock_sendmsg+0x33/0x40 [ 836.259857] ____sys_sendmsg+0x1d1/0x1f0 [ 836.260473] ___sys_sendmsg+0xab/0xf0 [ 836.261064] ? lock_acquire+0xce/0x2d0 [ 836.261669] ? find_held_lock+0x2b/0x80 [ 836.262272] ? __fget_files+0xb9/0x190 [ 836.262871] ? __fget_files+0xd3/0x190 [ 836.263462] __sys_sendmsg+0x51/0x90 [ 836.264064] do_syscall_64+0x3d/0x90 [ 836.264652] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 836.265425] RIP: 0033:0x7fdbe5e2677d [ 836.266012] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 ba ee ff ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 ee ee ff ff 48 [ 836.268485] RSP: 002b:00007fdbe48a75a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [ 836.269598] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fdbe5e2677d [ 836.270576] RDX: 0000000000000000 RSI: 00007fdbe48a7640 RDI: 000000000000003c [ 836.271565] RBP: 00007fdbe48a8368 R08: 0000000000000000 R09: 0000000000000000 [ 836.272546] R10: 00007fdbe48a84b0 R11: 0000000000000293 R12: 0000557bd17dc860 [ 836.273527] R13: 0000000000000000 R14: 0000557bd17dc860 R15: 00007fdbe48a7640 [ 836.274521] </TASK> To avoid using mode holding ldev->lock in the configure flow, we queue a work to the lag workqueue and cease wait on a completion object. In addition, we remove the lock from mlx5_lag_do_mirred() since it is not really protecting anything. It should be noted that an actual deadlock has not been observed. Signed-off-by: Eli Cohen <elic@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
aspsk
pushed a commit
that referenced
this pull request
Nov 30, 2022
Matt reported a splat at msk close time: BUG: sleeping function called from invalid context at net/mptcp/protocol.c:2877 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 155, name: packetdrill preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 4 locks held by packetdrill/155: #0: ffff888001536990 (&sb->s_type->i_mutex_key#6){+.+.}-{3:3}, at: __sock_release (net/socket.c:650) #1: ffff88800b498130 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close (net/mptcp/protocol.c:2973) #2: ffff88800b49a130 (sk_lock-AF_INET/1){+.+.}-{0:0}, at: __mptcp_close_ssk (net/mptcp/protocol.c:2363) #3: ffff88800b49a0b0 (slock-AF_INET){+...}-{2:2}, at: __lock_sock_fast (include/net/sock.h:1820) Preemption disabled at: 0x0 CPU: 1 PID: 155 Comm: packetdrill Not tainted 6.1.0-rc5 torvalds#365 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) __might_resched.cold (kernel/sched/core.c:9891) __mptcp_destroy_sock (include/linux/kernel.h:110) __mptcp_close (net/mptcp/protocol.c:2959) mptcp_subflow_queue_clean (include/net/sock.h:1777) __mptcp_close_ssk (net/mptcp/protocol.c:2363) mptcp_destroy_common (net/mptcp/protocol.c:3170) mptcp_destroy (include/net/sock.h:1495) __mptcp_destroy_sock (net/mptcp/protocol.c:2886) __mptcp_close (net/mptcp/protocol.c:2959) mptcp_close (net/mptcp/protocol.c:2974) inet_release (net/ipv4/af_inet.c:432) __sock_release (net/socket.c:651) sock_close (net/socket.c:1367) __fput (fs/file_table.c:320) task_work_run (kernel/task_work.c:181 (discriminator 1)) exit_to_user_mode_prepare (include/linux/resume_user_mode.h:49) syscall_exit_to_user_mode (kernel/entry/common.c:130) do_syscall_64 (arch/x86/entry/common.c:87) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) We can't call mptcp_close under the 'fast' socket lock variant, replace it with a sock_lock_nested() as the relevant code is already under the listening msk socket lock protection. Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Closes: multipath-tcp/mptcp_net-next#316 Fixes: 30e51b9 ("mptcp: fix unreleased socket in accept queue") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Apr 30, 2023
Andrii Nakryiko says: ==================== This patch set moves bpf_for(), bpf_for_each(), and bpf_repeat() macros from selftests-internal bpf_misc.h header to libbpf-provided bpf_helpers.h header. To do this in a way to allow users to feature-detect and guard such bpf_for()/bpf_for_each() uses on old kernels we also extend libbpf to improve unresolved kfunc calls handling and reporting. This lets us mark bpf_iter_num_{new,next,destroy}() declarations as __weak, and thus not fail program loading outright if such kfuncs are missing on the host kernel. Patches #1 and #2 do some simple clean ups and logging improvements. Patch #3 adds kfunc call poisoning and log fixup logic and is the hear of this patch set, effectively. Patch #4 adds selftest for this logic. Patches #4 and #5 move bpf_for()/bpf_for_each()/bpf_repeat() into bpf_helpers.h header and mark kfuncs as __weak to allow users to feature-detect and guard their uses. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
lmb
pushed a commit
that referenced
this pull request
May 23, 2023
When the dwc3 device is runtime suspended, various required clocks are in disabled state and it is not guaranteed that access to any registers would work. Depending on the SoC glue, a register read could be as benign as returning 0 or be fatal enough to hang the system. In order to prevent such scenarios of fatal errors, make sure to resume dwc3 then allow the function to proceed. Fixes: 72246da ("usb: Introduce DesignWare USB3 DRD Driver") Cc: stable@vger.kernel.org #3.2: 30332ee: debugfs: regset32: Add Runtime PM support Signed-off-by: Udipto Goswami <quic_ugoswami@quicinc.com> Reviewed-by: Johan Hovold <johan+linaro@kernel.org> Tested-by: Johan Hovold <johan+linaro@kernel.org> Acked-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com> Link: https://lore.kernel.org/r/20230509144836.6803-1-quic_ugoswami@quicinc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
lmb
pushed a commit
that referenced
this pull request
May 23, 2023
In the function ieee80211_tx_dequeue() there is a particular locking sequence: begin: spin_lock(&local->queue_stop_reason_lock); q_stopped = local->queue_stop_reasons[q]; spin_unlock(&local->queue_stop_reason_lock); However small the chance (increased by ftracetest), an asynchronous interrupt can occur in between of spin_lock() and spin_unlock(), and the interrupt routine will attempt to lock the same &local->queue_stop_reason_lock again. This will cause a costly reset of the CPU and the wifi device or an altogether hang in the single CPU and single core scenario. The only remaining spin_lock(&local->queue_stop_reason_lock) that did not disable interrupts was patched, which should prevent any deadlocks on the same CPU/core and the same wifi device. This is the probable trace of the deadlock: kernel: ================================ kernel: WARNING: inconsistent lock state kernel: 6.3.0-rc6-mt-20230401-00001-gf86822a1170f #4 Tainted: G W kernel: -------------------------------- kernel: inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. kernel: kworker/5:0/25656 [HC0[0]:SC0[0]:HE1:SE1] takes: kernel: ffff9d6190779478 (&local->queue_stop_reason_lock){+.?.}-{2:2}, at: return_to_handler+0x0/0x40 kernel: {IN-SOFTIRQ-W} state was registered at: kernel: lock_acquire+0xc7/0x2d0 kernel: _raw_spin_lock+0x36/0x50 kernel: ieee80211_tx_dequeue+0xb4/0x1330 [mac80211] kernel: iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm] kernel: iwl_mvm_mac_wake_tx_queue+0x2d/0xd0 [iwlmvm] kernel: ieee80211_queue_skb+0x450/0x730 [mac80211] kernel: __ieee80211_xmit_fast.constprop.66+0x834/0xa50 [mac80211] kernel: __ieee80211_subif_start_xmit+0x217/0x530 [mac80211] kernel: ieee80211_subif_start_xmit+0x60/0x580 [mac80211] kernel: dev_hard_start_xmit+0xb5/0x260 kernel: __dev_queue_xmit+0xdbe/0x1200 kernel: neigh_resolve_output+0x166/0x260 kernel: ip_finish_output2+0x216/0xb80 kernel: __ip_finish_output+0x2a4/0x4d0 kernel: ip_finish_output+0x2d/0xd0 kernel: ip_output+0x82/0x2b0 kernel: ip_local_out+0xec/0x110 kernel: igmpv3_sendpack+0x5c/0x90 kernel: igmp_ifc_timer_expire+0x26e/0x4e0 kernel: call_timer_fn+0xa5/0x230 kernel: run_timer_softirq+0x27f/0x550 kernel: __do_softirq+0xb4/0x3a4 kernel: irq_exit_rcu+0x9b/0xc0 kernel: sysvec_apic_timer_interrupt+0x80/0xa0 kernel: asm_sysvec_apic_timer_interrupt+0x1f/0x30 kernel: _raw_spin_unlock_irqrestore+0x3f/0x70 kernel: free_to_partial_list+0x3d6/0x590 kernel: __slab_free+0x1b7/0x310 kernel: kmem_cache_free+0x52d/0x550 kernel: putname+0x5d/0x70 kernel: do_sys_openat2+0x1d7/0x310 kernel: do_sys_open+0x51/0x80 kernel: __x64_sys_openat+0x24/0x30 kernel: do_syscall_64+0x5c/0x90 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc kernel: irq event stamp: 5120729 kernel: hardirqs last enabled at (5120729): [<ffffffff9d149936>] trace_graph_return+0xd6/0x120 kernel: hardirqs last disabled at (5120728): [<ffffffff9d149950>] trace_graph_return+0xf0/0x120 kernel: softirqs last enabled at (5069900): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40 kernel: softirqs last disabled at (5067555): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40 kernel: other info that might help us debug this: kernel: Possible unsafe locking scenario: kernel: CPU0 kernel: ---- kernel: lock(&local->queue_stop_reason_lock); kernel: <Interrupt> kernel: lock(&local->queue_stop_reason_lock); kernel: *** DEADLOCK *** kernel: 8 locks held by kworker/5:0/25656: kernel: #0: ffff9d618009d138 ((wq_completion)events_freezable){+.+.}-{0:0}, at: process_one_work+0x1ca/0x530 kernel: #1: ffffb1ef4637fe68 ((work_completion)(&local->restart_work)){+.+.}-{0:0}, at: process_one_work+0x1ce/0x530 kernel: #2: ffffffff9f166548 (rtnl_mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #3: ffff9d6190778728 (&rdev->wiphy.mtx){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #4: ffff9d619077b480 (&mvm->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #5: ffff9d61907bacd8 (&trans_pcie->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #6: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_queue_state_change+0x59/0x3a0 [iwlmvm] kernel: #7: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm] kernel: stack backtrace: kernel: CPU: 5 PID: 25656 Comm: kworker/5:0 Tainted: G W 6.3.0-rc6-mt-20230401-00001-gf86822a1170f #4 kernel: Hardware name: LENOVO 82H8/LNVNB161216, BIOS GGCN51WW 11/16/2022 kernel: Workqueue: events_freezable ieee80211_restart_work [mac80211] kernel: Call Trace: kernel: <TASK> kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: dump_stack_lvl+0x5f/0xa0 kernel: dump_stack+0x14/0x20 kernel: print_usage_bug.part.46+0x208/0x2a0 kernel: mark_lock.part.47+0x605/0x630 kernel: ? sched_clock+0xd/0x20 kernel: ? trace_clock_local+0x14/0x30 kernel: ? __rb_reserve_next+0x5f/0x490 kernel: ? _raw_spin_lock+0x1b/0x50 kernel: __lock_acquire+0x464/0x1990 kernel: ? mark_held_locks+0x4e/0x80 kernel: lock_acquire+0xc7/0x2d0 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ? ftrace_return_to_handler+0x8b/0x100 kernel: ? preempt_count_add+0x4/0x70 kernel: _raw_spin_lock+0x36/0x50 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_tx_dequeue+0xb4/0x1330 [mac80211] kernel: ? prepare_ftrace_return+0xc5/0x190 kernel: ? ftrace_graph_func+0x16/0x20 kernel: ? 0xffffffffc02ab0b1 kernel: ? lock_acquire+0xc7/0x2d0 kernel: ? iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm] kernel: ? ieee80211_tx_dequeue+0x9/0x1330 [mac80211] kernel: ? __rcu_read_lock+0x4/0x40 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_queue_state_change+0x311/0x3a0 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_wake_sw_queue+0x17/0x20 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_unmap+0x1c9/0x1f0 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_free+0x55/0x130 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_tx_free+0x63/0x80 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: _iwl_trans_pcie_gen2_stop_device+0x3f3/0x5b0 [iwlwifi] kernel: ? _iwl_trans_pcie_gen2_stop_device+0x9/0x5b0 [iwlwifi] kernel: ? mutex_lock_nested+0x4/0x30 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_trans_pcie_gen2_stop_device+0x5f/0x90 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_stop_device+0x78/0xd0 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: __iwl_mvm_mac_start+0x114/0x210 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_mac_start+0x76/0x150 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: drv_start+0x79/0x180 [mac80211] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_reconfig+0x1523/0x1ce0 [mac80211] kernel: ? synchronize_net+0x4/0x50 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_restart_work+0x108/0x170 [mac80211] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: process_one_work+0x250/0x530 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: worker_thread+0x48/0x3a0 kernel: ? __pfx_worker_thread+0x10/0x10 kernel: kthread+0x10f/0x140 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork+0x29/0x50 kernel: </TASK> Fixes: 4444bc2 ("wifi: mac80211: Proper mark iTXQs for resumption") Link: https://lore.kernel.org/all/1f58a0d1-d2b9-d851-73c3-93fcc607501c@alu.unizg.hr/ Reported-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr> Cc: Gregory Greenman <gregory.greenman@intel.com> Cc: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/all/cdc80531-f25f-6f9d-b15f-25e16130b53a@alu.unizg.hr/ Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Alexander Wetzel <alexander@wetzel-home.de> Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: tag, or it goes automatically? Link: https://lore.kernel.org/r/20230425164005.25272-1-mirsad.todorovac@alu.unizg.hr Signed-off-by: Johannes Berg <johannes.berg@intel.com>
lmb
pushed a commit
that referenced
this pull request
May 23, 2023
Florian Fainelli says: ==================== Support for Wake-on-LAN for Broadcom PHYs This patch series adds support for Wake-on-LAN to the Broadcom PHY driver. Specifically the BCM54210E/B50212E are capable of supporting Wake-on-LAN using an external pin typically wired up to a system's GPIO. These PHY operate a programmable Ethernet MAC destination address comparator which will fire up an interrupt whenever a match is received. Because of that, it was necessary to introduce patch #1 which allows the PHY driver's ->suspend() routine to be called unconditionally. This is necessary in our case because we need a hook point into the device suspend/resume flow to enable the wake-up interrupt as late as possible. Patch #2 adds support for the Broadcom PHY library and driver for Wake-on-LAN proper with the WAKE_UCAST, WAKE_MCAST, WAKE_BCAST, WAKE_MAGIC and WAKE_MAGICSECURE. Note that WAKE_FILTER is supportable, however this will require further discussions and be submitted as a RFC series later on. Patch #3 updates the GENET driver to defer to the PHY for Wake-on-LAN if the PHY supports it, thus allowing the MAC to be powered down to conserve power. Changes in v3: - collected Reviewed-by tags - explicitly use return 0 in bcm54xx_phy_probe() (Paolo) Changes in v2: - introduce PHY_ALWAYS_CALL_SUSPEND and only have the Broadcom PHY driver set this flag to minimize changes to the suspend flow to only drivers that need it - corrected possibly uninitialized variable in bcm54xx_set_wakeup_irq (Simon) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
lmb
pushed a commit
that referenced
this pull request
Jun 20, 2023
The cited commit adds a compeletion to remove dependency on rtnl lock. But it causes a deadlock for multiple encapsulations: crash> bt ffff8aece8a64000 PID: 1514557 TASK: ffff8aece8a64000 CPU: 3 COMMAND: "tc" #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418 #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898 #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8 #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core] #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core] #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core] #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core] #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core] #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core] #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core] #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8 torvalds#13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower] torvalds#14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower] torvalds#15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047 torvalds#16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31 torvalds#17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853 torvalds#18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835 torvalds#19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27 torvalds#20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245 torvalds#21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482 torvalds#22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a torvalds#23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2 torvalds#24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2 torvalds#25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f torvalds#26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8 torvalds#27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c crash> bt 0xffff8aeb07544000 PID: 1110766 TASK: ffff8aeb07544000 CPU: 0 COMMAND: "kworker/u20:9" #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418 #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88 #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core] #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core] #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core] #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012 #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f After the first encap is attached, flow will be added to encap entry's flows list. If neigh update is running at this time, the following encaps of the flow can't hold the encap_tbl_lock and sleep. If neigh update thread is waiting for that flow's init_done, deadlock happens. Fix it by holding lock outside of the for loop. If neigh update is running, prevent encap flows from offloading. Since the lock is held outside of the for loop, concurrent creation of encap entries is not allowed. So remove unnecessary wait_for_completion call for res_ready. Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
lmb
pushed a commit
that referenced
this pull request
Jun 20, 2023
…kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.4, take #3 - Fix the reported address of a watchpoint forwarded to userspace - Fix the freeing of the root of stage-2 page tables - Stop creating spurious PMU events to perform detection of the default PMU and use the existing PMU list instead.
lmb
pushed a commit
that referenced
this pull request
Jun 20, 2023
Currently, the per cpu upcall counters are allocated after the vport is created and inserted into the system. This could lead to the datapath accessing the counters before they are allocated resulting in a kernel Oops. Here is an example: PID: 59693 TASK: ffff0005f4f51500 CPU: 0 COMMAND: "ovs-vswitchd" #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4 #1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc #2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60 #3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58 #4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388 #5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c #6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68 #7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch] ... PID: 58682 TASK: ffff0005b2f0bf00 CPU: 0 COMMAND: "kworker/0:3" #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758 #1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994 #2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8 #3 [ffff80000a5d3120] die at ffffb70f0628234c #4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8 #5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4 #6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4 #7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710 #8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74 #9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac #10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24 #11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc #12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch] torvalds#13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch] torvalds#14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch] torvalds#15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch] torvalds#16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch] torvalds#17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90 We moved the per cpu upcall counter allocation to the existing vport alloc and free functions to solve this. Fixes: 95637d9 ("net: openvswitch: release vport resources on failure") Fixes: 1933ea3 ("net: openvswitch: Add support to count upcall packets") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
lmb
pushed a commit
that referenced
this pull request
Jun 20, 2023
Eduard Zingerman says: ==================== Update regsafe() to use check_ids() for scalar values. Otherwise the following unsafe pattern is accepted by verifier: 1: r9 = ... some pointer with range X ... 2: r6 = ... unbound scalar ID=a ... 3: r7 = ... unbound scalar ID=b ... 4: if (r6 > r7) goto +1 5: r6 = r7 6: if (r6 > X) goto ... --- checkpoint --- 7: r9 += r7 8: *(u64 *)r9 = Y This example is unsafe because not all execution paths verify r7 range. Because of the jump at (4) the verifier would arrive at (6) in two states: I. r6{.id=b}, r7{.id=b} via path 1-6; II. r6{.id=a}, r7{.id=b} via path 1-4, 6. Currently regsafe() does not call check_ids() for scalar registers, thus from POV of regsafe() states (I) and (II) are identical. The change is split in two parts: - patches #1,2: update for mark_chain_precision() to propagate precision marks through scalar IDs. - patches #3,4: update for regsafe() to use a special version of check_ids() for precise scalar values. Changelog: - V5 -> V6: - check_ids() is modified to disallow mapping different 'old_id' to the same 'cur_id', check_scalar_ids() simplified (Andrii); - idset_push() updated to return -EFAULT instead of -1 (Andrii); - comments fixed in check_ids_in_regsafe() test case (Maxim Mikityanskiy); - fixed memset warning in states_equal() reported in [4]. - V4 -> V5 (all changes are based on feedback for V4 from Andrii): - mark_precise_scalar_ids() error code is updated to EFAULT; - bpf_verifier_env::idmap_scratch field type is changed to struct bpf_idmap to encapsulate temporary ID generation counter; - regsafe() is updated to call scalar_regs_exact() only for env->explore_alu_limits case (this had no measurable impact on verification duration when tested using veristat). - V3 -> V4: - check_ids() in regsafe() is replaced by check_scalar_ids(), as discussed with Andrii in [3], Note: I did not transfer Andrii's ack for patch #3 from V3 because of the changes to the algorithm. - reg_id_scratch is renamed to idset_scratch; - mark_precise_scalar_ids() is modified to propagate error from idset_push(); - test cases adjusted according to feedback from Andrii for V3. - V2 -> V3: - u32_hashset for IDs used for range transfer is removed; - mark_chain_precision() is updated as discussed with Andrii in [2]. - V1 -> v2: - 'rold->precise' and 'rold->id' checks are dropped as unsafe (thanks to discussion with Yonghong); - patches #3,4 adding tracking of ids used for range transfer in order to mitigate performance impact. - RFC -> V1: - Function verifier.c:mark_equal_scalars_as_read() is dropped, as it was an incorrect fix for problem solved by commit [3]. - check_ids() is called only for precise scalar values. - Test case updated to use inline assembly. [V1] https://lore.kernel.org/bpf/20230526184126.3104040-1-eddyz87@gmail.com/ [V2] https://lore.kernel.org/bpf/20230530172739.447290-1-eddyz87@gmail.com/ [V3] https://lore.kernel.org/bpf/20230606222411.1820404-1-eddyz87@gmail.com/ [V4] https://lore.kernel.org/bpf/20230609210143.2625430-1-eddyz87@gmail.com/ [V5] https://lore.kernel.org/bpf/20230612160801.2804666-1-eddyz87@gmail.com/ [RFC] https://lore.kernel.org/bpf/20221128163442.280187-1-eddyz87@gmail.com/ [1] https://gist.github.com/eddyz87/a32ea7e62a27d3c201117c9a39ab4286 [2] https://lore.kernel.org/bpf/20230530172739.447290-1-eddyz87@gmail.com/T/#mc21009dcd8574b195c1860a98014bb037f16f450 [3] https://lore.kernel.org/bpf/20230606222411.1820404-1-eddyz87@gmail.com/T/#m89da8eeb2fa8c9ca1202c5d0b6660e1f72e45e04 [4] https://lore.kernel.org/oe-kbuild-all/202306131550.U3M9AJGm-lkp@intel.com/ ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Jul 19, 2023
Petr Machata says: ==================== mlxsw: Manage RIF across PVID changes The mlxsw driver currently makes the assumption that the user applies configuration in a bottom-up manner. Thus netdevices need to be added to the bridge before IP addresses are configured on that bridge or SVI added on top of it. Enslaving a netdevice to another netdevice that already has uppers is in fact forbidden by mlxsw for this reason. Despite this safety, it is rather easy to get into situations where the offloaded configuration is just plain wrong. As an example, take a front panel port, configure an IP address: it gets a RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the port from the bridge again, but the RIF never comes back. There is a number of similar situations, where changing the configuration there and back utterly breaks the offload. The situation is going to be made better by implementing a range of replays and post-hoc offloads. In this patch set, address the ordering issues related to creation of bridge RIFs. Currently, mlxsw has several shortcomings with regards to RIF handling due to PVID changes: - In order to cause RIF for a bridge device to be created, the user is expected first to set PVID, then to add an IP address. The reverse ordering is disallowed, which is not very user-friendly. - When such bridge gets a VLAN upper whose VID was the same as the existing PVID, and this VLAN netdevice gets an IP address, a RIF is created for this netdevice. The new RIF is then assigned to the 802.1Q FID for the given VID. This results in a working configuration. However, then, when the VLAN netdevice is removed again, the RIF for the bridge itself is never reassociated to the PVID. - PVID cannot be changed once the bridge has uppers. Presumably this is because the driver does not manage RIFs properly in face of PVID changes. However, as the previous point shows, it is still possible to get into invalid configurations. This patch set addresses these issues and relaxes some of the ordering requirements that mlxsw had. The patch set proceeds as follows: - In patch #1, pass extack to mlxsw_sp_br_ban_rif_pvid_change() - To relax ordering between setting PVID and adding an IP address to a bridge, mlxsw must be able to request that a RIF is created with a given VLAN ID, instead of trying to deduce it from the current netdevice settings, which do not reflect the user-requested values yet. This is done in patches #2 and #3. - Similarly, mlxsw_sp_inetaddr_bridge_event() will need to make decisions based on the user-requested value of PVID, not the current value. Thus in patches #4 and #5, add a new argument which carries the requested PVID value. - Finally in patch #6 relax the ban on PVID changes when a bridge has uppers. Instead, add the logic necessary for creation of a RIF as a result of PVID change. - Relevant selftests are presented afterwards. In patch #7 a preparatory helper is added to lib.sh. Patches #8, #9, #10 and #11 include selftests themselves. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Jul 21, 2023
Ido Schimmel says: ==================== Add backup nexthop ID support tl;dr ===== This patchset adds a new bridge port attribute specifying the nexthop object ID to attach to a redirected skb as tunnel metadata. The ID is used by the VXLAN driver to choose the target VTEP for the skb. This is useful for EVPN multi-homing, where we want to redirect local (intra-rack) traffic upon carrier loss through one of the other VTEPs (ES peers) connected to the target host. Background ========== In a typical EVPN multi-homing setup each host is multi-homed using a set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf switches in a rack. These switches act as VTEPs and are not directly connected (as opposed to MLAG), but can communicate with each other (as well as with VTEPs in remote racks) via spine switches over L3. The control plane uses Type 1 routes [1] to create a mapping between an ES and VTEPs where the ES has active links. In addition, the control plane uses Type 2 routes [2] to create a mapping between {MAC, VLAN} and an ES. These tables are then used by the control plane to instruct VTEPs how to reach remote hosts. For example, assuming {MAC X, VLAN Y} is accessible via ES1 and this ES has active links to VTEP1 and VTEP2. The control plane will program the following entries to a remote VTEP: # ip nexthop add id 1 via $VTEP1_IP fdb # ip nexthop add id 2 via $VTEP2_IP fdb # ip nexthop add id 10 group 1/2 fdb # bridge fdb add $MAC_X dev vx0 master extern_learn vlan $VLAN_Y # bridge fdb add $MAC_Y dev vx0 self extern_learn nhid 10 src_vni $VNI_Y Remote traffic towards the host will be load balanced between VTEP1 and VTEP2. If the control plane notices a carrier loss on the ES1 link connected to VTEP1, it will issue a Type 1 route withdraw, prompting remote VTEPs to remove the effected nexthop from the group: # ip nexthop replace id 10 group 2 fdb Motivation ========== While remote traffic can be redirected to a VTEP with an active ES link by withdrawing a Type 1 route, the same is not true for local traffic. A host that is multi-homed to VTEP1 and VTEP2 via another ES (e.g., ES2) will send its traffic to {MAC X, VLAN Y} via one of these two switches, according to its LAG hash algorithm which is not under our control. If the traffic arrives at VTEP1 - which no longer has an active ES1 link - it will be dropped due to the carrier loss. In MLAG setups, the above problem is solved by redirecting the traffic through the peer link upon carrier loss. This is achieved by defining the peer link as the backup port of the host facing bond. For example: # bridge link set dev bond0 backup_port bond_peer Unlike MLAG, there is no peer link between the leaf switches in EVPN. Instead, upon carrier loss, local traffic should be redirected through one of the active ES peers. This can be achieved by defining the VXLAN port as the backup port of the host facing bonds. For example: # bridge link set dev es1_bond backup_port vx0 However, the VXLAN driver is not programmed with FDB entries for locally attached hosts and therefore does not know to which VTEP to redirect the traffic to. This will result in the traffic being replicated to all the VTEPs (potentially hundreds) in the network and each VTEP dropping the traffic, except for the active ES peer. Avoiding the flooding by programming local FDB entries in the VXLAN driver is not a viable solution as it requires to significantly increase the number of programmed FDB entries. Implementation ============== The proposed solution is to create an FDB nexthop group for each ES with the IP addresses of the active ES peers and set this ID as the backup nexthop ID (new bridge port attribute) of the ES link. For example, on VTEP1: # ip nexthop add id 1 via $VTEP2_IP fdb # ip nexthop add id 10 group 1 fdb # bridge link set dev es1_bond backup_nhid 10 # bridge link set dev es1_bond backup_port vx0 When the ES link loses its carrier, traffic will be redirected to the VXLAN port, but instead of only attaching the tunnel ID (i.e., VNI) as tunnel metadata to the skb, the backup nexthop ID will be attached as well. The VXLAN driver will then use this information to forward the skb via the nexthop object associated with the ID, as if the skb hit an FDB entry associated with this ID. Testing ======= A test for both the existing backup port attribute as well as the new backup nexthop ID attribute is added in patch #4. Patchset overview ================= Patch #1 extends the tunnel key structure with the new nexthop ID field. Patch #2 uses the new field in the VXLAN driver to forward packets via the specified nexthop ID. Patch #3 adds the new backup nexthop ID bridge port attribute and adjusts the bridge driver to attach the ID as tunnel metadata upon redirection. Patch #4 adds a selftest. iproute2 patches can be found here [3]. Changelog ========= Since RFC [4]: * Added Nik's tags. [1] https://datatracker.ietf.org/doc/html/rfc7432#section-7.1 [2] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2 [3] https://github.com/idosch/iproute2/tree/submit/backup_nhid_v1 [4] https://lore.kernel.org/netdev/20230713070925.3955850-1-idosch@nvidia.com/ ==================== Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Jul 28, 2023
The err_restore_domain flow was accidently inserted into the success path in commit 1000dcc ("iommu: Allow IOMMU_RESV_DIRECT to work on ARM"). It should only happen if iommu_create_device_direct_mappings() fails. This caused the domains the be wrongly changed and freed whenever the sysfs is used, resulting in an oops: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 3417 Comm: avocado Not tainted 6.4.0-rc4-next-20230602 #3 Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.3.6 07/06/2021 RIP: 0010:__iommu_attach_device+0xc/0xa0 Code: c0 c3 cc cc cc cc 48 89 f0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 48 8b 47 08 <48> 8b 00 48 85 c0 74 74 48 89 f5 e8 64 12 49 00 41 89 c4 85 c0 74 RSP: 0018:ffffabae0220bd48 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff9ac04f70e410 RCX: 0000000000000001 RDX: ffff9ac044db20c0 RSI: ffff9ac044fa50d0 RDI: ffff9ac04f70e410 RBP: ffff9ac044fa50d0 R08: 1000000100209001 R09: 00000000000002dc R10: 0000000000000000 R11: 0000000000000000 R12: ffff9ac043d54700 R13: ffff9ac043d54700 R14: 0000000000000001 R15: 0000000000000001 FS: 00007f02e30ae000(0000) GS:ffff9afeb2440000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000012afca006 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> ? __die+0x24/0x70 ? page_fault_oops+0x82/0x150 ? __iommu_queue_command_sync+0x80/0xc0 ? exc_page_fault+0x69/0x150 ? asm_exc_page_fault+0x26/0x30 ? __iommu_attach_device+0xc/0xa0 ? __iommu_attach_device+0x1c/0xa0 __iommu_device_set_domain+0x42/0x80 __iommu_group_set_domain_internal+0x5d/0x160 iommu_setup_default_domain+0x318/0x400 iommu_group_store_type+0xb1/0x200 kernfs_fop_write_iter+0x12f/0x1c0 vfs_write+0x2a2/0x3b0 ksys_write+0x63/0xe0 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7f02e2f14a6f Reorganize the error flow so that the success branch and error branches are clearer. Fixes: 1000dcc ("iommu: Allow IOMMU_RESV_DIRECT to work on ARM") Reported-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com> Tested-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/0-v1-5bd8cc969d9e+1f1-iommu_set_def_fix_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
borkmann
pushed a commit
that referenced
this pull request
Jul 28, 2023
Victor Nogueira says: ==================== net: sched: Fixes for classifiers Four different classifiers (bpf, u32, matchall, and flower) are calling tcf_bind_filter in their callbacks, but arent't undoing it by calling tcf_unbind_filter if their was an error after binding. This patch set fixes all this by calling tcf_unbind_filter in such cases. This set also undoes a refcount decrement in cls_u32 when an update fails under specific conditions which are described in patch #3. v1 -> v2: * Remove blank line after fixes tag * Fix reverse xmas tree issues pointed out by Simon v2 -> v3: * Inline functions cls_bpf_set_parms and fl_set_parms to avoid adding yet another parameter (and a return value at it) to them. * Remove similar fixes for u32 and matchall, which will be sent soon, once we find a way to do the fixes without adding a return parameter to their set_parms functions. v3 -> v4: * Inline mall_set_parms to avoid adding yet another parameter. * Remove set_flags parameter from u32_set_parms and create a separate function for calling tcf_bind_filter and tcf_unbind_filter in case of failure. * Change cover letter title to also encompass refcnt fix for u32 v4 -> v5: * Change back tag to net ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Jul 28, 2023
Petr Machata says: ==================== mlxsw: Permit enslavement to netdevices with uppers The mlxsw driver currently makes the assumption that the user applies configuration in a bottom-up manner. Thus netdevices need to be added to the bridge before IP addresses are configured on that bridge or SVI added on top of it. Enslaving a netdevice to another netdevice that already has uppers is in fact forbidden by mlxsw for this reason. Despite this safety, it is rather easy to get into situations where the offloaded configuration is just plain wrong. As an example, take a front panel port, configure an IP address: it gets a RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the port from the bridge again, but the RIF never comes back. There is a number of similar situations, where changing the configuration there and back utterly breaks the offload. Similarly, detaching a front panel port from a configured topology means unoffloading of this whole topology -- VLAN uppers, next hops, etc. Attaching the port back is then not permitted at all. If it were, it would not result in a working configuration, because much of mlxsw is written to react to changes in immediate configuration. There is nothing that would go visit netdevices in the attached-to topology and offload existing routes and VLAN memberships, for example. In this patchset, introduce a number of replays to be invoked so that this sort of post-hoc offload is supported. Then remove the vetoes that disallowed enslavement of front panel ports to other netdevices with uppers. The patchset progresses as follows: - In patch #1, fix an issue in the bridge driver. To my knowledge, the issue could not have resulted in a buggy behavior previously, and thus is packaged with this patchset instead of being sent separately to net. - In patch #2, add a new helper to the switchdev code. - In patch #3, drop mlxsw selftests that will not be relevant after this patchset anymore. - Patches #4, #5, #6, #7 and #8 prepare the codebase for smoother introduction of the rest of the code. - Patches #9, #10, #11, #12, torvalds#13 and torvalds#14 replay various aspects of upper configuration when a front panel port is introduced into a topology. Individual patches take care of bridge and LAG RIF memberships, switchdev replay, nexthop and neighbors replay, and MACVLAN offload. - Patches torvalds#15 and torvalds#16 introduce RIFs for newly-relevant netdevices when a front panel port is enslaved (in which case all uppers are newly relevant), or, respectively, deslaved (in which case the newly-relevant netdevice is the one being deslaved). - Up until this point, the introduced scaffolding was not really used, because mlxsw still forbids enslavement of mlxsw netdevices to uppers with uppers. In patch torvalds#17, this condition is finally relaxed. A sizable selftest suite is available to test all this new code. That will be sent in a separate patchset. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Aug 4, 2023
The cited commit holds encap tbl lock unconditionally when setting up dests. But it may cause the following deadlock: PID: 1063722 TASK: ffffa062ca5d0000 CPU: 13 COMMAND: "handler8" #0 [ffffb14de05b7368] __schedule at ffffffffa1d5aa91 #1 [ffffb14de05b7410] schedule at ffffffffa1d5afdb #2 [ffffb14de05b7430] schedule_preempt_disabled at ffffffffa1d5b528 #3 [ffffb14de05b7440] __mutex_lock at ffffffffa1d5d6cb #4 [ffffb14de05b74e8] mutex_lock_nested at ffffffffa1d5ddeb #5 [ffffb14de05b74f8] mlx5e_tc_tun_encap_dests_set at ffffffffc12f2096 [mlx5_core] #6 [ffffb14de05b7568] post_process_attr at ffffffffc12d9fc5 [mlx5_core] #7 [ffffb14de05b75a0] mlx5e_tc_add_fdb_flow at ffffffffc12de877 [mlx5_core] #8 [ffffb14de05b75f0] __mlx5e_add_fdb_flow at ffffffffc12e0eef [mlx5_core] #9 [ffffb14de05b7660] mlx5e_tc_add_flow at ffffffffc12e12f7 [mlx5_core] #10 [ffffb14de05b76b8] mlx5e_configure_flower at ffffffffc12e1686 [mlx5_core] #11 [ffffb14de05b7720] mlx5e_rep_indr_offload at ffffffffc12e3817 [mlx5_core] #12 [ffffb14de05b7730] mlx5e_rep_indr_setup_tc_cb at ffffffffc12e388a [mlx5_core] torvalds#13 [ffffb14de05b7740] tc_setup_cb_add at ffffffffa1ab2ba8 torvalds#14 [ffffb14de05b77a0] fl_hw_replace_filter at ffffffffc0bdec2f [cls_flower] torvalds#15 [ffffb14de05b7868] fl_change at ffffffffc0be6caa [cls_flower] torvalds#16 [ffffb14de05b7908] tc_new_tfilter at ffffffffa1ab71f0 [1031218.028143] wait_for_completion+0x24/0x30 [1031218.028589] mlx5e_update_route_decap_flows+0x9a/0x1e0 [mlx5_core] [1031218.029256] mlx5e_tc_fib_event_work+0x1ad/0x300 [mlx5_core] [1031218.029885] process_one_work+0x24e/0x510 Actually no need to hold encap tbl lock if there is no encap action. Fix it by checking if encap action exists or not before holding encap tbl lock. Fixes: 37c3b9f ("net/mlx5e: Prevent encap offload when neigh update is running") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
borkmann
pushed a commit
that referenced
this pull request
Aug 4, 2023
syzkaller found zero division error [0] in div_s64_rem() called from get_cycle_time_elapsed(), where sched->cycle_time is the divisor. We have tests in parse_taprio_schedule() so that cycle_time will never be 0, and actually cycle_time is not 0 in get_cycle_time_elapsed(). The problem is that the types of divisor are different; cycle_time is s64, but the argument of div_s64_rem() is s32. syzkaller fed this input and 0x100000000 is cast to s32 to be 0. @TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME={0xc, 0x8, 0x100000000} We use s64 for cycle_time to cast it to ktime_t, so let's keep it and set max for cycle_time. While at it, we prevent overflow in setup_txtime() and add another test in parse_taprio_schedule() to check if cycle_time overflows. Also, we add a new tdc test case for this issue. [0]: divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 PID: 103 Comm: kworker/1:3 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: ipv6_addrconf addrconf_dad_work RIP: 0010:div_s64_rem include/linux/math64.h:42 [inline] RIP: 0010:get_cycle_time_elapsed net/sched/sch_taprio.c:223 [inline] RIP: 0010:find_entry_to_transmit+0x252/0x7e0 net/sched/sch_taprio.c:344 Code: 3c 02 00 0f 85 5e 05 00 00 48 8b 4c 24 08 4d 8b bd 40 01 00 00 48 8b 7c 24 48 48 89 c8 4c 29 f8 48 63 f7 48 99 48 89 74 24 70 <48> f7 fe 48 29 d1 48 8d 04 0f 49 89 cc 48 89 44 24 20 49 8d 85 10 RSP: 0018:ffffc90000acf260 EFLAGS: 00010206 RAX: 177450e0347560cf RBX: 0000000000000000 RCX: 177450e0347560cf RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000100000000 RBP: 0000000000000056 R08: 0000000000000000 R09: ffffed10020a0934 R10: ffff8880105049a7 R11: ffff88806cf3a520 R12: ffff888010504800 R13: ffff88800c00d800 R14: ffff8880105049a0 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0edf84f0e8 CR3: 000000000d73c002 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> get_packet_txtime net/sched/sch_taprio.c:508 [inline] taprio_enqueue_one+0x900/0xff0 net/sched/sch_taprio.c:577 taprio_enqueue+0x378/0xae0 net/sched/sch_taprio.c:658 dev_qdisc_enqueue+0x46/0x170 net/core/dev.c:3732 __dev_xmit_skb net/core/dev.c:3821 [inline] __dev_queue_xmit+0x1b2f/0x3000 net/core/dev.c:4169 dev_queue_xmit include/linux/netdevice.h:3088 [inline] neigh_resolve_output net/core/neighbour.c:1552 [inline] neigh_resolve_output+0x4a7/0x780 net/core/neighbour.c:1532 neigh_output include/net/neighbour.h:544 [inline] ip6_finish_output2+0x924/0x17d0 net/ipv6/ip6_output.c:135 __ip6_finish_output+0x620/0xaa0 net/ipv6/ip6_output.c:196 ip6_finish_output net/ipv6/ip6_output.c:207 [inline] NF_HOOK_COND include/linux/netfilter.h:292 [inline] ip6_output+0x206/0x410 net/ipv6/ip6_output.c:228 dst_output include/net/dst.h:458 [inline] NF_HOOK.constprop.0+0xea/0x260 include/linux/netfilter.h:303 ndisc_send_skb+0x872/0xe80 net/ipv6/ndisc.c:508 ndisc_send_ns+0xb5/0x130 net/ipv6/ndisc.c:666 addrconf_dad_work+0xc14/0x13f0 net/ipv6/addrconf.c:4175 process_one_work+0x92c/0x13a0 kernel/workqueue.c:2597 worker_thread+0x60f/0x1240 kernel/workqueue.c:2748 kthread+0x2fe/0x3f0 kernel/kthread.c:389 ret_from_fork+0x2c/0x50 arch/x86/entry/entry_64.S:308 </TASK> Modules linked in: Fixes: 4cfd577 ("taprio: Add support for txtime-assist mode") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Co-developed-by: Eric Dumazet <edumazet@google.com> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Aug 14, 2023
We are seeing kernel crash in below test scenario: 1. make DUT connect to an WPA3 encrypted 11ax AP in Ch44 HE80 2. use "wpa_cli -i <inf> disconnect" to disconnect 3. wait for DUT to automatically reconnect Kernel crashes while waiting, below shows the crash stack: [ 755.120868] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 755.120871] #PF: supervisor read access in kernel mode [ 755.120872] #PF: error_code(0x0000) - not-present page [ 755.120873] PGD 0 P4D 0 [ 755.120875] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 755.120876] CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Not tainted 5.19.0-rc1+ #3 [ 755.120878] Hardware name: Intel(R) Client Systems NUC11PHi7/NUC11PHBi7, BIOS PHTGL579.0063.2021.0707.1057 07/07/2021 [ 755.120879] RIP: 0010:ath12k_dp_process_rx_err+0x2b6/0x14a0 [ath12k] [ 755.120890] Code: 01 c0 48 c1 e0 05 48 8b 9c 07 b8 b2 00 00 48 c7 c0 61 ff 0e c1 48 85 db 53 48 0f 44 c6 48 c7 c6 80 9d 0f c1 50 e8 1a 25 00 00 <4c> 8b 3b 4d 8b 76 14 41 59 41 5a 41 8b 87 78 43 01 00 4d 85 f6 89 [ 755.120891] RSP: 0018:ffff9a93402c8d10 EFLAGS: 00010282 [ 755.120892] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000303 [ 755.120893] RDX: 0000000000000000 RSI: ffffffff93b7cbe9 RDI: 00000000ffffffff [ 755.120894] RBP: ffff9a93402c8e50 R08: ffffffff93e65360 R09: ffffffff942e044d [ 755.120894] R10: 0000000000000000 R11: 0000000000000063 R12: ffff8dbec5420000 [ 755.120895] R13: ffff8dbec5420000 R14: ffff8dbdefe9a0a0 R15: ffff8dbec5420000 [ 755.120896] FS: 0000000000000000(0000) GS:ffff8dc2705c0000(0000) knlGS:0000000000000000 [ 755.120897] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 755.120898] CR2: 0000000000000000 CR3: 0000000107be4005 CR4: 0000000000770ee0 [ 755.120898] PKRU: 55555554 [ 755.120899] Call Trace: [ 755.120900] <IRQ> [ 755.120903] ? ath12k_pci_write32+0x2e/0x80 [ath12k] [ 755.120910] ath12k_dp_service_srng+0x214/0x2e0 [ath12k] [ 755.120917] ath12k_pci_ext_grp_napi_poll+0x26/0x80 [ath12k] [ 755.120923] __napi_poll+0x2b/0x1c0 [ 755.120925] net_rx_action+0x2a1/0x2f0 [ 755.120927] __do_softirq+0xfa/0x2e9 [ 755.120929] irq_exit_rcu+0xb9/0xd0 [ 755.120932] common_interrupt+0xc1/0xe0 [ 755.120934] </IRQ> [ 755.120934] <TASK> [ 755.120935] asm_common_interrupt+0x2c/0x40 [ 755.120936] RIP: 0010:cpuidle_enter_state+0xdd/0x3a0 [ 755.120938] Code: 00 31 ff e8 45 e2 74 ff 80 7d d7 00 74 16 9c 58 0f 1f 40 00 f6 c4 02 0f 85 a0 02 00 00 31 ff e8 69 79 7b ff fb 0f 1f 44 00 00 <45> 85 ff 0f 88 6d 01 00 00 49 63 d7 4c 2b 6d c8 48 8d 04 52 48 8d [ 755.120939] RSP: 0018:ffff9a934018be50 EFLAGS: 00000246 [ 755.120940] RAX: ffff8dc2705c0000 RBX: 0000000000000002 RCX: 000000000000001f [ 755.120941] RDX: 000000afd0b532d3 RSI: ffffffff93b7cbe9 RDI: ffffffff93b8b66e [ 755.120942] RBP: ffff9a934018be88 R08: 0000000000000002 R09: 0000000000030500 [ 755.120942] R10: ffff9a934018be18 R11: 0000000000000741 R12: ffffba933fdc0600 [ 755.120943] R13: 000000afd0b532d3 R14: ffffffff93fcbc60 R15: 0000000000000002 [ 755.120945] cpuidle_enter+0x2e/0x40 [ 755.120946] call_cpuidle+0x23/0x40 [ 755.120948] do_idle+0x1ff/0x260 [ 755.120950] cpu_startup_entry+0x1d/0x20 [ 755.120951] start_secondary+0x10d/0x130 [ 755.120953] secondary_startup_64_no_verify+0xd3/0xdb [ 755.120956] </TASK> [ 755.120956] Modules linked in: michael_mic rfcomm cmac algif_hash algif_skcipher af_alg bnep qrtr_mhi intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm_intel qrtr snd_hda_codec_hdmi kvm irqbypass ath12k snd_hda_intel snd_seq_midi crct10dif_pclmul mhi ghash_clmulni_intel snd_intel_dspcfg snd_seq_midi_event aesni_intel qmi_helpers i915 snd_rawmidi crypto_simd snd_hda_codec cryptd cec intel_cstate snd_hda_core mac80211 rc_core nouveau snd_seq snd_hwdep btusb drm_buddy drm_ttm_helper nls_iso8859_1 snd_pcm ttm btrtl snd_seq_device wmi_bmof mxm_wmi input_leds cfg80211 joydev btbcm drm_display_helper snd_timer btintel mei_me libarc4 drm_kms_helper bluetooth i2c_algo_bit snd fb_sys_fops syscopyarea mei sysfillrect ecdh_generic soundcore sysimgblt ecc acpi_pad mac_hid sch_fq_codel ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp ramoops parport reed_solomon drm efi_pstore ip_tables x_tables autofs4 [ 755.120992] hid_generic usbhid hid ax88179_178a usbnet mii nvme nvme_core rtsx_pci_sdmmc crc32_pclmul i2c_i801 intel_lpss_pci i2c_smbus intel_lpss rtsx_pci idma64 virt_dma vmd wmi video [ 755.121002] CR2: 0000000000000000 The crash is because, for WCN7850, only ab->pdev[0] is initialized, while mac_id here is misused to retrieve pdev and it is not zero, leading to a NULL pointer access. Fix this issue by getting pdev_id first and then use it to retrieve pdev. Also fix some other code snippets which have the same issue. Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0-03427-QCAHMTSWPL_V1.0_V2.0_SILICONZ-1.15378.4 Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com> Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com> Link: https://lore.kernel.org/r/20230714080658.3140-1-quic_bqiang@quicinc.com
borkmann
pushed a commit
that referenced
this pull request
Aug 14, 2023
Alexander Lobakin says: ==================== page_pool: a couple of assorted optimizations That initially was a spin-off of the IAVF PP series[0], but has grown (and shrunk) since then a bunch. In fact, it consists of three semi-independent blocks: * #1-2: Compile-time optimization. Split page_pool.h into 2 headers to not overbloat the consumers not needing complex inline helpers and then stop including it in skbuff.h at all. The first patch is also prereq for the whole series. * #3: Improve cacheline locality for users of the Page Pool frag API. * #4-6: Use direct cache recycling more aggressively, when it is safe obviously. In addition, make sure nobody wants to use Page Pool API with disabled interrupts. Patches #1 and #5 are authored by Yunsheng and Jakub respectively, with small modifications from my side as per ML discussions. For the perf numbers for #3-6, please see individual commit messages. Also available on my GH with many more Page Pool goodies[1]. [0] https://lore.kernel.org/netdev/20230530150035.1943669-1-aleksander.lobakin@intel.com [1] https://github.com/alobakin/linux/commits/iavf-pp-frag ==================== Link: https://lore.kernel.org/r/20230804180529.2483231-1-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Aug 14, 2023
Ido Schimmel says: ==================== nexthop: Nexthop dump fixes Patches #1 and #3 fix two problems related to nexthops and nexthop buckets dump, respectively. Patch #2 is a preparation for the third patch. The pattern described in these patches of splitting the NLMSG_DONE to a separate response is prevalent in other rtnetlink dump callbacks. I don't know if it's because I'm missing something or if this was done intentionally to ensure the message is delivered to user space. After commit 0642840 ("af_netlink: ensure that NLMSG_DONE never fails in dumps") this is no longer necessary and I can improve these dump callbacks assuming this analysis is correct. No regressions in existing tests: # ./fib_nexthops.sh [...] Tests passed: 230 Tests failed: 0 ==================== Link: https://lore.kernel.org/r/20230808075233.3337922-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Aug 18, 2023
Petr Machata says: ==================== mlxsw: Support traffic redirection from a locked bridge port Ido Schimmel writes: It is possible to add a filter that redirects traffic from the ingress of a bridge port that is locked (i.e., performs security / SMAC lookup) and has learning enabled. For example: # ip link add name br0 type bridge # ip link set dev swp1 master br0 # bridge link set dev swp1 learning on locked on mab on # tc qdisc add dev swp1 clsact # tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw src_ip 192.0.2.1 action mirred egress redirect dev swp2 In the kernel's Rx path, this filter is evaluated before the Rx handler of the bridge, which means that redirected traffic should not be affected by bridge port configuration such as learning. However, the hardware data path is a bit different and the redirect action (FORWARDING_ACTION in hardware) merely attaches a pointer to the packet, which is later used by the L2 lookup stage to understand how to forward the packet. Between both stages - ingress ACL and L2 lookup - learning and security lookup are performed, which means that redirected traffic is affected by bridge port configuration, unlike in the kernel's data path. The learning discrepancy was handled in commit 577fa14 ("mlxsw: spectrum: Do not process learned records with a dummy FID") by simply ignoring learning notifications generated by the redirected traffic. A similar solution is not possible for the security / SMAC lookup since - unlike learning - the CPU is not involved and packets that failed the lookup are dropped by the device. Instead, solve this by prepending the ignore action to the redirect action and use it to instruct the device to disable both learning and the security / SMAC lookup for redirected traffic. Patch #1 adds the ignore action. Patch #2 prepends the action to the redirect action in flower offload code. Patch #3 removes the workaround in commit 577fa14 ("mlxsw: spectrum: Do not process learned records with a dummy FID") since it is no longer needed. Patch #4 adds a test case. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Aug 24, 2023
qxl_mode_dumb_create() dereferences the qobj returned by qxl_gem_object_create_with_handle(), but the handle is the only one holding a reference to it. A potential attacker could guess the returned handle value and closes it between the return of qxl_gem_object_create_with_handle() and the qobj usage, triggering a use-after-free scenario. Reproducer: int dri_fd =-1; struct drm_mode_create_dumb arg = {0}; void gem_close(int handle); void* trigger(void* ptr) { int ret; arg.width = arg.height = 0x20; arg.bpp = 32; ret = ioctl(dri_fd, DRM_IOCTL_MODE_CREATE_DUMB, &arg); if(ret) { perror("[*] DRM_IOCTL_MODE_CREATE_DUMB Failed"); exit(-1); } gem_close(arg.handle); while(1) { struct drm_mode_create_dumb args = {0}; args.width = args.height = 0x20; args.bpp = 32; ret = ioctl(dri_fd, DRM_IOCTL_MODE_CREATE_DUMB, &args); if (ret) { perror("[*] DRM_IOCTL_MODE_CREATE_DUMB Failed"); exit(-1); } printf("[*] DRM_IOCTL_MODE_CREATE_DUMB created, %d\n", args.handle); gem_close(args.handle); } return NULL; } void gem_close(int handle) { struct drm_gem_close args; args.handle = handle; int ret = ioctl(dri_fd, DRM_IOCTL_GEM_CLOSE, &args); // gem close handle if (!ret) printf("gem close handle %d\n", args.handle); } int main(void) { dri_fd= open("/dev/dri/card0", O_RDWR); printf("fd:%d\n", dri_fd); if(dri_fd == -1) return -1; pthread_t tid1; if(pthread_create(&tid1,NULL,trigger,NULL)){ perror("[*] thread_create tid1\n"); return -1; } while (1) { gem_close(arg.handle); } return 0; } This is a KASAN report: ================================================================== BUG: KASAN: slab-use-after-free in qxl_mode_dumb_create+0x3c2/0x400 linux/drivers/gpu/drm/qxl/qxl_dumb.c:69 Write of size 1 at addr ffff88801136c240 by task poc/515 CPU: 1 PID: 515 Comm: poc Not tainted 6.3.0 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 Call Trace: <TASK> __dump_stack linux/lib/dump_stack.c:88 dump_stack_lvl+0x48/0x70 linux/lib/dump_stack.c:106 print_address_description linux/mm/kasan/report.c:319 print_report+0xd2/0x660 linux/mm/kasan/report.c:430 kasan_report+0xd2/0x110 linux/mm/kasan/report.c:536 __asan_report_store1_noabort+0x17/0x30 linux/mm/kasan/report_generic.c:383 qxl_mode_dumb_create+0x3c2/0x400 linux/drivers/gpu/drm/qxl/qxl_dumb.c:69 drm_mode_create_dumb linux/drivers/gpu/drm/drm_dumb_buffers.c:96 drm_mode_create_dumb_ioctl+0x1f5/0x2d0 linux/drivers/gpu/drm/drm_dumb_buffers.c:102 drm_ioctl_kernel+0x21d/0x430 linux/drivers/gpu/drm/drm_ioctl.c:788 drm_ioctl+0x56f/0xcc0 linux/drivers/gpu/drm/drm_ioctl.c:891 vfs_ioctl linux/fs/ioctl.c:51 __do_sys_ioctl linux/fs/ioctl.c:870 __se_sys_ioctl linux/fs/ioctl.c:856 __x64_sys_ioctl+0x13d/0x1c0 linux/fs/ioctl.c:856 do_syscall_x64 linux/arch/x86/entry/common.c:50 do_syscall_64+0x5b/0x90 linux/arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc linux/arch/x86/entry/entry_64.S:120 RIP: 0033:0x7ff5004ff5f7 Code: 00 00 00 48 8b 05 99 c8 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 c8 0d 00 f7 d8 64 89 01 48 RSP: 002b:00007ff500408ea8 EFLAGS: 00000286 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff5004ff5f7 RDX: 00007ff500408ec0 RSI: 00000000c02064b2 RDI: 0000000000000003 RBP: 00007ff500408ef0 R08: 0000000000000000 R09: 000000000000002a R10: 0000000000000000 R11: 0000000000000286 R12: 00007fff1c6cdafe R13: 00007fff1c6cdaff R14: 00007ff500408fc0 R15: 0000000000802000 </TASK> Allocated by task 515: kasan_save_stack+0x38/0x70 linux/mm/kasan/common.c:45 kasan_set_track+0x25/0x40 linux/mm/kasan/common.c:52 kasan_save_alloc_info+0x1e/0x40 linux/mm/kasan/generic.c:510 ____kasan_kmalloc linux/mm/kasan/common.c:374 __kasan_kmalloc+0xc3/0xd0 linux/mm/kasan/common.c:383 kasan_kmalloc linux/./include/linux/kasan.h:196 kmalloc_trace+0x48/0xc0 linux/mm/slab_common.c:1066 kmalloc linux/./include/linux/slab.h:580 kzalloc linux/./include/linux/slab.h:720 qxl_bo_create+0x11a/0x610 linux/drivers/gpu/drm/qxl/qxl_object.c:124 qxl_gem_object_create+0xd9/0x360 linux/drivers/gpu/drm/qxl/qxl_gem.c:58 qxl_gem_object_create_with_handle+0xa1/0x180 linux/drivers/gpu/drm/qxl/qxl_gem.c:89 qxl_mode_dumb_create+0x1cd/0x400 linux/drivers/gpu/drm/qxl/qxl_dumb.c:63 drm_mode_create_dumb linux/drivers/gpu/drm/drm_dumb_buffers.c:96 drm_mode_create_dumb_ioctl+0x1f5/0x2d0 linux/drivers/gpu/drm/drm_dumb_buffers.c:102 drm_ioctl_kernel+0x21d/0x430 linux/drivers/gpu/drm/drm_ioctl.c:788 drm_ioctl+0x56f/0xcc0 linux/drivers/gpu/drm/drm_ioctl.c:891 vfs_ioctl linux/fs/ioctl.c:51 __do_sys_ioctl linux/fs/ioctl.c:870 __se_sys_ioctl linux/fs/ioctl.c:856 __x64_sys_ioctl+0x13d/0x1c0 linux/fs/ioctl.c:856 do_syscall_x64 linux/arch/x86/entry/common.c:50 do_syscall_64+0x5b/0x90 linux/arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc linux/arch/x86/entry/entry_64.S:120 Freed by task 515: kasan_save_stack+0x38/0x70 linux/mm/kasan/common.c:45 kasan_set_track+0x25/0x40 linux/mm/kasan/common.c:52 kasan_save_free_info+0x2e/0x60 linux/mm/kasan/generic.c:521 ____kasan_slab_free linux/mm/kasan/common.c:236 ____kasan_slab_free+0x180/0x1f0 linux/mm/kasan/common.c:200 __kasan_slab_free+0x12/0x30 linux/mm/kasan/common.c:244 kasan_slab_free linux/./include/linux/kasan.h:162 slab_free_hook linux/mm/slub.c:1781 slab_free_freelist_hook+0xd2/0x1a0 linux/mm/slub.c:1807 slab_free linux/mm/slub.c:3787 __kmem_cache_free+0x196/0x2d0 linux/mm/slub.c:3800 kfree+0x78/0x120 linux/mm/slab_common.c:1019 qxl_ttm_bo_destroy+0x140/0x1a0 linux/drivers/gpu/drm/qxl/qxl_object.c:49 ttm_bo_release+0x678/0xa30 linux/drivers/gpu/drm/ttm/ttm_bo.c:381 kref_put linux/./include/linux/kref.h:65 ttm_bo_put+0x50/0x80 linux/drivers/gpu/drm/ttm/ttm_bo.c:393 qxl_gem_object_free+0x3e/0x60 linux/drivers/gpu/drm/qxl/qxl_gem.c:42 drm_gem_object_free+0x5c/0x90 linux/drivers/gpu/drm/drm_gem.c:974 kref_put linux/./include/linux/kref.h:65 __drm_gem_object_put linux/./include/drm/drm_gem.h:431 drm_gem_object_put linux/./include/drm/drm_gem.h:444 qxl_gem_object_create_with_handle+0x151/0x180 linux/drivers/gpu/drm/qxl/qxl_gem.c:100 qxl_mode_dumb_create+0x1cd/0x400 linux/drivers/gpu/drm/qxl/qxl_dumb.c:63 drm_mode_create_dumb linux/drivers/gpu/drm/drm_dumb_buffers.c:96 drm_mode_create_dumb_ioctl+0x1f5/0x2d0 linux/drivers/gpu/drm/drm_dumb_buffers.c:102 drm_ioctl_kernel+0x21d/0x430 linux/drivers/gpu/drm/drm_ioctl.c:788 drm_ioctl+0x56f/0xcc0 linux/drivers/gpu/drm/drm_ioctl.c:891 vfs_ioctl linux/fs/ioctl.c:51 __do_sys_ioctl linux/fs/ioctl.c:870 __se_sys_ioctl linux/fs/ioctl.c:856 __x64_sys_ioctl+0x13d/0x1c0 linux/fs/ioctl.c:856 do_syscall_x64 linux/arch/x86/entry/common.c:50 do_syscall_64+0x5b/0x90 linux/arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc linux/arch/x86/entry/entry_64.S:120 The buggy address belongs to the object at ffff88801136c000 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 576 bytes inside of freed 1024-byte region [ffff88801136c000, ffff88801136c400) The buggy address belongs to the physical page: page:0000000089fc329b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11368 head:0000000089fc329b order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0xfffffc0010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc0010200 ffff888007841dc0 dead000000000122 0000000000000000 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88801136c100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88801136c180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88801136c200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88801136c280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88801136c300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Disabling lock debugging due to kernel taint Instead of returning a weak reference to the qxl_bo object, return the created drm_gem_object and let the caller decrement the reference count when it no longer needs it. As a convenience, if the caller is not interested in the gobj object, it can pass NULL to the parameter and the reference counting is descremented internally. The bug and the reproducer were originally found by the Zero Day Initiative project (ZDI-CAN-20940). Link: https://www.zerodayinitiative.com/ Signed-off-by: Wander Lairson Costa <wander@redhat.com> Cc: stable@vger.kernel.org Reviewed-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230814165119.90847-1-wander@redhat.com
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
Noticed with: make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin Direct leak of 45 byte(s) in 1 object(s) allocated from: #0 0x7f213f87243b in strdup (/lib64/libasan.so.8+0x7243b) #1 0x63d15f in evsel__set_filter util/evsel.c:1371 #2 0x63d15f in evsel__append_filter util/evsel.c:1387 #3 0x63d15f in evsel__append_tp_filter util/evsel.c:1400 #4 0x62cd52 in evlist__append_tp_filter util/evlist.c:1145 #5 0x62cd52 in evlist__append_tp_filter_pids util/evlist.c:1196 #6 0x541e49 in trace__set_filter_loop_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3646 #7 0x541e49 in trace__set_filter_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3670 #8 0x541e49 in trace__run /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3970 #9 0x541e49 in cmd_trace /home/acme/git/perf-tools/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools/tools/perf/perf.c:377 #12 0x4196da in run_argv /home/acme/git/perf-tools/tools/perf/perf.c:421 torvalds#13 0x4196da in main /home/acme/git/perf-tools/tools/perf/perf.c:537 torvalds#14 0x7f213e84a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Free it on evsel__exit(). Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/20230719202951.534582-2-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
To plug these leaks detected with: $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin ================================================================= ==473890==ERROR: LeakSanitizer: detected memory leaks Direct leak of 112 byte(s) in 1 object(s) allocated from: #0 0x7fdf19aba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987836 in zalloc (/home/acme/bin/perf+0x987836) #2 0x5367ae in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1289 #3 0x5367ae in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307 #4 0x5367ae in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468 #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177 #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685 #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712 #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055 #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 torvalds#13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Direct leak of 2048 byte(s) in 1 object(s) allocated from: #0 0x7f788fcba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af) #1 0x5337c0 in trace__sys_enter /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2342 #2 0x52bfb4 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3191 #3 0x52bfb4 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3699 #4 0x542883 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3726 #5 0x542883 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4069 #6 0x542883 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5155 #7 0x5ef232 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #8 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #9 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #10 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 #11 0x7f788ec4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Indirect leak of 48 byte(s) in 1 object(s) allocated from: #0 0x7fdf19aba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af) #1 0x77b335 in intlist__new util/intlist.c:116 #2 0x5367fd in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1293 #3 0x5367fd in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307 #4 0x5367fd in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468 #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177 #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685 #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712 #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055 #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 torvalds#13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/20230719202951.534582-4-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
In 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv") it only was freeing if strcmp(evsel->tp_format->system, "syscalls") returned zero, while the corresponding initialization of evsel->priv was being performed if it was _not_ zero, i.e. if the tp system wasn't 'syscalls'. Just stop looking for that and free it if evsel->priv was set, which should be equivalent. Also use the pre-existing evsel_trace__delete() function. This resolves these leaks, detected with: $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin ================================================================= ==481565==ERROR: LeakSanitizer: detected memory leaks Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966) #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307 #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333 #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458 #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480 #6 0x540e8b in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3212 #7 0x540e8b in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891 #8 0x540e8b in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156 #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966) #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307 #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333 #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458 #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480 #6 0x540dd1 in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3205 #7 0x540dd1 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891 #8 0x540dd1 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156 #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) SUMMARY: AddressSanitizer: 80 byte(s) leaked in 2 allocation(s). [root@quaco ~]# With this we plug all leaks with "perf trace sleep 1". Fixes: 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv") Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Riccardo Mancini <rickyman7@gmail.com> Link: https://lore.kernel.org/lkml/20230719202951.534582-5-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
…failure to add a probe Building perf with EXTRA_CFLAGS="-fsanitize=address" a leak is detect when trying to add a probe to a non-existent function: # perf probe -x ~/bin/perf dso__neW Probe point 'dso__neW' not found. Error: Failed to add events. ================================================================= ==296634==ERROR: LeakSanitizer: detected memory leaks Direct leak of 128 byte(s) in 1 object(s) allocated from: #0 0x7f67642ba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x7f67641a76f1 in allocate_cfi (/lib64/libdw.so.1+0x3f6f1) Direct leak of 65 byte(s) in 1 object(s) allocated from: #0 0x7f67642b95b5 in __interceptor_realloc.part.0 (/lib64/libasan.so.8+0xb95b5) #1 0x6cac75 in strbuf_grow util/strbuf.c:64 #2 0x6ca934 in strbuf_init util/strbuf.c:25 #3 0x9337d2 in synthesize_perf_probe_point util/probe-event.c:2018 #4 0x92be51 in try_to_find_probe_trace_events util/probe-event.c:964 #5 0x93d5c6 in convert_to_probe_trace_events util/probe-event.c:3512 #6 0x93d6d5 in convert_perf_probe_events util/probe-event.c:3529 #7 0x56f37f in perf_add_probe_events /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:354 #8 0x572fbc in __cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:738 #9 0x5730f2 in cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:766 #10 0x635d81 in run_builtin /var/home/acme/git/perf-tools-next/tools/perf/perf.c:323 #11 0x6362c1 in handle_internal_command /var/home/acme/git/perf-tools-next/tools/perf/perf.c:377 #12 0x63667a in run_argv /var/home/acme/git/perf-tools-next/tools/perf/perf.c:421 torvalds#13 0x636b8d in main /var/home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#14 0x7f676302950f in __libc_start_call_main (/lib64/libc.so.6+0x2950f) SUMMARY: AddressSanitizer: 193 byte(s) leaked in 2 allocation(s). # synthesize_perf_probe_point() returns a "detachec" strbuf, i.e. a malloc'ed string that needs to be free'd. An audit will be performed to find other such cases. Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/ZM0l1Oxamr4SVjfY@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
While debugging a segfault on 'perf lock contention' without an available perf.data file I noticed that it was basically calling: perf_session__delete(ERR_PTR(-1)) Resulting in: (gdb) run lock contention Starting program: /root/bin/perf lock contention [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". failed to open perf.data: No such file or directory (try 'perf record' first) Initializing perf session failed Program received signal SIGSEGV, Segmentation fault. 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 2858 if (!session->auxtrace) (gdb) p session $1 = (struct perf_session *) 0xffffffffffffffff (gdb) bt #0 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 #1 0x000000000057bb4d in perf_session__delete (session=0xffffffffffffffff) at util/session.c:300 #2 0x000000000047c421 in __cmd_contention (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2161 #3 0x000000000047dc95 in cmd_lock (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2604 #4 0x0000000000501466 in run_builtin (p=0xe597a8 <commands+552>, argc=2, argv=0x7fffffffe200) at perf.c:322 #5 0x00000000005016d5 in handle_internal_command (argc=2, argv=0x7fffffffe200) at perf.c:375 #6 0x0000000000501824 in run_argv (argcp=0x7fffffffe02c, argv=0x7fffffffe020) at perf.c:419 #7 0x0000000000501b11 in main (argc=2, argv=0x7fffffffe200) at perf.c:535 (gdb) So just set it to NULL after using PTR_ERR(session) to decode the error as perf_session__delete(NULL) is supported. The same problem was found in 'perf top' after an audit of all perf_session__new() failure handling. Fixes: 6ef81c5 ("perf session: Return error code for perf_session__new() function on failure") Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jeremie Galarneau <jeremie.galarneau@efficios.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com> Cc: Mukesh Ojha <mojha@codeaurora.org> Cc: Nageswara R Sastry <rnsastry@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Shawn Landden <shawn@git.icu> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tzvetomir Stoyanov <tstoyanov@vmware.com> Link: https://lore.kernel.org/lkml/ZN4Q2rxxsL08A8rd@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
While debugging a segfault on 'perf lock contention' without an available perf.data file I noticed that it was basically calling: perf_session__delete(ERR_PTR(-1)) Resulting in: (gdb) run lock contention Starting program: /root/bin/perf lock contention [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". failed to open perf.data: No such file or directory (try 'perf record' first) Initializing perf session failed Program received signal SIGSEGV, Segmentation fault. 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 2858 if (!session->auxtrace) (gdb) p session $1 = (struct perf_session *) 0xffffffffffffffff (gdb) bt #0 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 #1 0x000000000057bb4d in perf_session__delete (session=0xffffffffffffffff) at util/session.c:300 #2 0x000000000047c421 in __cmd_contention (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2161 #3 0x000000000047dc95 in cmd_lock (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2604 #4 0x0000000000501466 in run_builtin (p=0xe597a8 <commands+552>, argc=2, argv=0x7fffffffe200) at perf.c:322 #5 0x00000000005016d5 in handle_internal_command (argc=2, argv=0x7fffffffe200) at perf.c:375 #6 0x0000000000501824 in run_argv (argcp=0x7fffffffe02c, argv=0x7fffffffe020) at perf.c:419 #7 0x0000000000501b11 in main (argc=2, argv=0x7fffffffe200) at perf.c:535 (gdb) So just set it to NULL after using PTR_ERR(session) to decode the error as perf_session__delete(NULL) is supported. Fixes: eef4fee ("perf lock: Dynamically allocate lockhash_table") Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Ross Zwisler <zwisler@chromium.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Yang Jihong <yangjihong1@huawei.com> Link: https://lore.kernel.org/lkml/ZN4R1AYfsD2J8lRs@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
Hou Tao says: ==================== Fix the unmatched unit_size of bpf_mem_cache From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to fix the reported warning [0] when the unit_size of bpf_mem_cache is mismatched with the object size of underly slab-cache. Patch #1 fixes the warning by adjusting size_index according to the value of KMALLOC_MIN_SIZE, so bpf_mem_cache with unit_size which is smaller than KMALLOC_MIN_SIZE or is not aligned with KMALLOC_MIN_SIZE will be redirected to bpf_mem_cache with bigger unit_size. Patch #2 doesn't do prefill for these redirected bpf_mem_cache to save memory. Patch #3 adds further error check in bpf_mem_alloc_init() to ensure the unit_size and object_size are always matched and to prevent potential issues due to the mismatch. Please see individual patches for more details. And comments are always welcome. [0]: https://lore.kernel.org/bpf/87jztjmmy4.fsf@all.your.base.are.belong.to.us ==================== Link: https://lore.kernel.org/r/20230908133923.2675053-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
macb_set_tx_clk() is called under a spinlock but itself calls clk_set_rate() which can sleep. This results in: | BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 | pps pps1: new PPS source ptp1 | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 40, name: kworker/u4:3 | preempt_count: 1, expected: 0 | RCU nest depth: 0, expected: 0 | 4 locks held by kworker/u4:3/40: | #0: ffff000003409148 | macb ff0c0000.ethernet: gem-ptp-timer ptp clock registered. | ((wq_completion)events_power_efficient){+.+.}-{0:0}, at: process_one_work+0x14c/0x51c | #1: ffff8000833cbdd8 ((work_completion)(&pl->resolve)){+.+.}-{0:0}, at: process_one_work+0x14c/0x51c | #2: ffff000004f01578 (&pl->state_mutex){+.+.}-{4:4}, at: phylink_resolve+0x44/0x4e8 | #3: ffff000004f06f50 (&bp->lock){....}-{3:3}, at: macb_mac_link_up+0x40/0x2ac | irq event stamp: 113998 | hardirqs last enabled at (113997): [<ffff800080e8503c>] _raw_spin_unlock_irq+0x30/0x64 | hardirqs last disabled at (113998): [<ffff800080e84478>] _raw_spin_lock_irqsave+0xac/0xc8 | softirqs last enabled at (113608): [<ffff800080010630>] __do_softirq+0x430/0x4e4 | softirqs last disabled at (113597): [<ffff80008001614c>] ____do_softirq+0x10/0x1c | CPU: 0 PID: 40 Comm: kworker/u4:3 Not tainted 6.5.0-11717-g9355ce8b2f50-dirty torvalds#368 | Hardware name: ... ZynqMP ... (DT) | Workqueue: events_power_efficient phylink_resolve | Call trace: | dump_backtrace+0x98/0xf0 | show_stack+0x18/0x24 | dump_stack_lvl+0x60/0xac | dump_stack+0x18/0x24 | __might_resched+0x144/0x24c | __might_sleep+0x48/0x98 | __mutex_lock+0x58/0x7b0 | mutex_lock_nested+0x24/0x30 | clk_prepare_lock+0x4c/0xa8 | clk_set_rate+0x24/0x8c | macb_mac_link_up+0x25c/0x2ac | phylink_resolve+0x178/0x4e8 | process_one_work+0x1ec/0x51c | worker_thread+0x1ec/0x3e4 | kthread+0x120/0x124 | ret_from_fork+0x10/0x20 The obvious fix is to move the call to macb_set_tx_clk() out of the protected area. This seems safe as rx and tx are both disabled anyway at this point. It is however not entirely clear what the spinlock shall protect. It could be the read-modify-write access to the NCFGR register, but this is accessed in macb_set_rx_mode() and macb_set_rxcsum_feature() as well without holding the spinlock. It could also be the register accesses done in mog_init_rings() or macb_init_buffers(), but again these functions are called without holding the spinlock in macb_hresp_error_task(). The locking seems fishy in this driver and it might deserve another look before this patch is applied. Fixes: 633e98a ("net: macb: use resolved link config in mac_link_up()") Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Link: https://lore.kernel.org/r/20230908112913.1701766-1-s.hauer@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
Jiri Pirko says: ==================== expose devlink instances relationships From: Jiri Pirko <jiri@nvidia.com> Currently, the user can instantiate new SF using "devlink port add" command. That creates an E-switch representor devlink port. When user activates this SF, there is an auxiliary device created and probed for it which leads to SF devlink instance creation. There is 1:1 relationship between E-switch representor devlink port and the SF auxiliary device devlink instance. Also, for example in mlx5, one devlink instance is created for PCI device and one is created for an auxiliary device that represents the uplink port. The relation between these is invisible to the user. Patches #1-#3 and #5 are small preparations. Patch #4 adds netnsid attribute for nested devlink if that in a different namespace. Patch #5 is the main one in this set, introduces the relationship tracking infrastructure later on used to track SFs, linecards and devlink instance relationships with nested devlink instances. Expose the relation to the user by introducing new netlink attribute DEVLINK_PORT_FN_ATTR_DEVLINK which contains the devlink instance related to devlink port function. This is done by patch #8. Patch #9 implements this in mlx5 driver. Patch #10 converts the linecard nested devlink handling to the newly introduced rel infrastructure. Patch #11 benefits from the rel infra and introduces possiblitily to have relation between devlink instances. Patch #12 implements this in mlx5 driver. Examples: $ devlink dev pci/0000:08:00.0: nested_devlink auxiliary/mlx5_core.eth.0 pci/0000:08:00.1: nested_devlink auxiliary/mlx5_core.eth.1 auxiliary/mlx5_core.eth.1 auxiliary/mlx5_core.eth.0 $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 106 pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached roce enable $ devlink port function set pci/0000:08:00.0/32768 state active $ devlink port show pci/0000:08:00.0/32768 pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false function: hw_addr 00:00:00:00:00:00 state active opstate attached roce enable nested_devlink auxiliary/mlx5_core.sf.2 $ devlink port show pci/0000:08:00.0/32768 pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false function: hw_addr 00:00:00:00:00:00 state active opstate attached roce enable nested_devlink auxiliary/mlx5_core.sf.2 nested_devlink_netns ns1 ==================== Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
Syzkaller reported a sleep in atomic context bug relating to the HASHCHK handler logic: BUG: sleeping function called from invalid context at arch/powerpc/kernel/traps.c:1518 in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 25040, name: syz-executor preempt_count: 0, expected: 0 RCU nest depth: 0, expected: 0 no locks held by syz-executor/25040. irq event stamp: 34 hardirqs last enabled at (33): [<c000000000048b38>] prep_irq_for_enabled_exit arch/powerpc/kernel/interrupt.c:56 [inline] hardirqs last enabled at (33): [<c000000000048b38>] interrupt_exit_user_prepare_main+0x148/0x600 arch/powerpc/kernel/interrupt.c:230 hardirqs last disabled at (34): [<c00000000003e6a4>] interrupt_enter_prepare+0x144/0x4f0 arch/powerpc/include/asm/interrupt.h:176 softirqs last enabled at (0): [<c000000000281954>] copy_process+0x16e4/0x4750 kernel/fork.c:2436 softirqs last disabled at (0): [<0000000000000000>] 0x0 CPU: 15 PID: 25040 Comm: syz-executor Not tainted 6.5.0-rc5-00001-g3ccdff6bb06d #3 Hardware name: IBM,9105-22A POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1040.00 (NL1040_021) hv:phyp pSeries Call Trace: [c0000000a8247ce0] [c00000000032b0e4] __might_resched+0x3b4/0x400 kernel/sched/core.c:10189 [c0000000a8247d80] [c0000000008c7dc8] __might_fault+0xa8/0x170 mm/memory.c:5853 [c0000000a8247dc0] [c00000000004160c] do_program_check+0x32c/0xb20 arch/powerpc/kernel/traps.c:1518 [c0000000a8247e50] [c000000000009b2c] program_check_common_virt+0x3bc/0x3c0 To determine if a trap was caused by a HASHCHK instruction, we inspect the user instruction that triggered the trap. However this may sleep if the page needs to be faulted in (get_user_instr() reaches __get_user(), which calls might_fault() and triggers the bug message). Move the HASHCHK handler logic to after we allow IRQs, which is fine because we are only interested in HASHCHK if it's a user space trap. Fixes: 5bcba4e ("powerpc/dexcr: Handle hashchk exception") Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230915034604.45393-1-bgray@linux.ibm.com
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
Sebastian Andrzej Siewior says: ==================== net: hsr: Properly parse HSRv1 supervisor frames. this is a follow-up to https://lore.kernel.org/all/20230825153111.228768-1-lukma@denx.de/ replacing https://lore.kernel.org/all/20230914124731.1654059-1-lukma@denx.de/ by grabing/ adding tags and reposting with a commit message plus a missing __packed to a struct (#2) plus extending the testsuite to sover HSRv1 which is what broke here (#3-#5). HSRv0 is (was) not affected. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Oct 6, 2023
Fix an error detected by memory sanitizer: ``` ==4033==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x55fb0fbedfc7 in read_alias_info tools/perf/util/pmu.c:457:6 #1 0x55fb0fbea339 in check_info_data tools/perf/util/pmu.c:1434:2 #2 0x55fb0fbea339 in perf_pmu__check_alias tools/perf/util/pmu.c:1504:9 #3 0x55fb0fbdca85 in parse_events_add_pmu tools/perf/util/parse-events.c:1429:32 #4 0x55fb0f965230 in parse_events_parse tools/perf/util/parse-events.y:299:6 #5 0x55fb0fbdf6b2 in parse_events__scanner tools/perf/util/parse-events.c:1822:8 #6 0x55fb0fbdf8c1 in __parse_events tools/perf/util/parse-events.c:2094:8 #7 0x55fb0fa8ffa9 in parse_events tools/perf/util/parse-events.h:41:9 #8 0x55fb0fa8ffa9 in test_event tools/perf/tests/parse-events.c:2393:8 #9 0x55fb0fa8f458 in test__pmu_events tools/perf/tests/parse-events.c:2551:15 #10 0x55fb0fa6d93f in run_test tools/perf/tests/builtin-test.c:242:9 #11 0x55fb0fa6d93f in test_and_print tools/perf/tests/builtin-test.c:271:8 #12 0x55fb0fa6d082 in __cmd_test tools/perf/tests/builtin-test.c:442:5 torvalds#13 0x55fb0fa6d082 in cmd_test tools/perf/tests/builtin-test.c:564:9 torvalds#14 0x55fb0f942720 in run_builtin tools/perf/perf.c:322:11 torvalds#15 0x55fb0f942486 in handle_internal_command tools/perf/perf.c:375:8 torvalds#16 0x55fb0f941dab in run_argv tools/perf/perf.c:419:2 torvalds#17 0x55fb0f941dab in main tools/perf/perf.c:535:3 ``` Fixes: 7b723db ("perf pmu: Be lazy about loading event info files from sysfs") Signed-off-by: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20230914022425.1489035-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Oct 6, 2023
The following call trace shows a deadlock issue due to recursive locking of mutex "device_mutex". First lock acquire is in target_for_each_device() and second in target_free_device(). PID: 148266 TASK: ffff8be21ffb5d00 CPU: 10 COMMAND: "iscsi_ttx" #0 [ffffa2bfc9ec3b18] __schedule at ffffffffa8060e7f #1 [ffffa2bfc9ec3ba0] schedule at ffffffffa8061224 #2 [ffffa2bfc9ec3bb8] schedule_preempt_disabled at ffffffffa80615ee #3 [ffffa2bfc9ec3bc8] __mutex_lock at ffffffffa8062fd7 #4 [ffffa2bfc9ec3c40] __mutex_lock_slowpath at ffffffffa80631d3 #5 [ffffa2bfc9ec3c50] mutex_lock at ffffffffa806320c #6 [ffffa2bfc9ec3c68] target_free_device at ffffffffc0935998 [target_core_mod] #7 [ffffa2bfc9ec3c90] target_core_dev_release at ffffffffc092f975 [target_core_mod] #8 [ffffa2bfc9ec3ca0] config_item_put at ffffffffa79d250f #9 [ffffa2bfc9ec3cd0] config_item_put at ffffffffa79d2583 #10 [ffffa2bfc9ec3ce0] target_devices_idr_iter at ffffffffc0933f3a [target_core_mod] #11 [ffffa2bfc9ec3d00] idr_for_each at ffffffffa803f6fc #12 [ffffa2bfc9ec3d60] target_for_each_device at ffffffffc0935670 [target_core_mod] torvalds#13 [ffffa2bfc9ec3d98] transport_deregister_session at ffffffffc0946408 [target_core_mod] torvalds#14 [ffffa2bfc9ec3dc8] iscsit_close_session at ffffffffc09a44a6 [iscsi_target_mod] torvalds#15 [ffffa2bfc9ec3df0] iscsit_close_connection at ffffffffc09a4a88 [iscsi_target_mod] torvalds#16 [ffffa2bfc9ec3df8] finish_task_switch at ffffffffa76e5d07 torvalds#17 [ffffa2bfc9ec3e78] iscsit_take_action_for_connection_exit at ffffffffc0991c23 [iscsi_target_mod] torvalds#18 [ffffa2bfc9ec3ea0] iscsi_target_tx_thread at ffffffffc09a403b [iscsi_target_mod] torvalds#19 [ffffa2bfc9ec3f08] kthread at ffffffffa76d8080 torvalds#20 [ffffa2bfc9ec3f50] ret_from_fork at ffffffffa8200364 Fixes: 36d4cb4 ("scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Link: https://lore.kernel.org/r/20230918225848.66463-1-junxiao.bi@oracle.com Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
borkmann
pushed a commit
that referenced
this pull request
Oct 6, 2023
Petr Machata says: ==================== mlxsw: Provide enhancements and new feature Vadim Pasternak writes: Patch #1 - Optimize transaction size for efficient retrieval of module data. Patch #3 - Enable thermal zone binding with new cooling device. Patch #4 - Employ standard macros for dividing buffer into the chunks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Oct 17, 2023
Fix the deadlock by refactoring the MR cache cleanup flow to flush the workqueue without holding the rb_lock. This adds a race between cache cleanup and creation of new entries which we solve by denied creation of new entries after cache cleanup started. Lockdep: WARNING: possible circular locking dependency detected [ 2785.326074 ] 6.2.0-rc6_for_upstream_debug_2023_01_31_14_02 #1 Not tainted [ 2785.339778 ] ------------------------------------------------------ [ 2785.340848 ] devlink/53872 is trying to acquire lock: [ 2785.341701 ] ffff888124f8c0c8 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0xc8/0x900 [ 2785.343403 ] [ 2785.343403 ] but task is already holding lock: [ 2785.344464 ] ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] [ 2785.346273 ] [ 2785.346273 ] which lock already depends on the new lock. [ 2785.346273 ] [ 2785.347720 ] [ 2785.347720 ] the existing dependency chain (in reverse order) is: [ 2785.349003 ] [ 2785.349003 ] -> #1 (&dev->cache.rb_lock){+.+.}-{3:3}: [ 2785.350160 ] __mutex_lock+0x14c/0x15c0 [ 2785.350962 ] delayed_cache_work_func+0x2d1/0x610 [mlx5_ib] [ 2785.352044 ] process_one_work+0x7c2/0x1310 [ 2785.352879 ] worker_thread+0x59d/0xec0 [ 2785.353636 ] kthread+0x28f/0x330 [ 2785.354370 ] ret_from_fork+0x1f/0x30 [ 2785.355135 ] [ 2785.355135 ] -> #0 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}: [ 2785.356515 ] __lock_acquire+0x2d8a/0x5fe0 [ 2785.357349 ] lock_acquire+0x1c1/0x540 [ 2785.358121 ] __flush_work+0xe8/0x900 [ 2785.358852 ] __cancel_work_timer+0x2c7/0x3f0 [ 2785.359711 ] mlx5_mkey_cache_cleanup+0xfb/0x250 [mlx5_ib] [ 2785.360781 ] mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x16/0x30 [mlx5_ib] [ 2785.361969 ] __mlx5_ib_remove+0x68/0x120 [mlx5_ib] [ 2785.362960 ] mlx5r_remove+0x63/0x80 [mlx5_ib] [ 2785.363870 ] auxiliary_bus_remove+0x52/0x70 [ 2785.364715 ] device_release_driver_internal+0x3c1/0x600 [ 2785.365695 ] bus_remove_device+0x2a5/0x560 [ 2785.366525 ] device_del+0x492/0xb80 [ 2785.367276 ] mlx5_detach_device+0x1a9/0x360 [mlx5_core] [ 2785.368615 ] mlx5_unload_one_devl_locked+0x5a/0x110 [mlx5_core] [ 2785.369934 ] mlx5_devlink_reload_down+0x292/0x580 [mlx5_core] [ 2785.371292 ] devlink_reload+0x439/0x590 [ 2785.372075 ] devlink_nl_cmd_reload+0xaef/0xff0 [ 2785.372973 ] genl_family_rcv_msg_doit.isra.0+0x1bd/0x290 [ 2785.374011 ] genl_rcv_msg+0x3ca/0x6c0 [ 2785.374798 ] netlink_rcv_skb+0x12c/0x360 [ 2785.375612 ] genl_rcv+0x24/0x40 [ 2785.376295 ] netlink_unicast+0x438/0x710 [ 2785.377121 ] netlink_sendmsg+0x7a1/0xca0 [ 2785.377926 ] sock_sendmsg+0xc5/0x190 [ 2785.378668 ] __sys_sendto+0x1bc/0x290 [ 2785.379440 ] __x64_sys_sendto+0xdc/0x1b0 [ 2785.380255 ] do_syscall_64+0x3d/0x90 [ 2785.381031 ] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 2785.381967 ] [ 2785.381967 ] other info that might help us debug this: [ 2785.381967 ] [ 2785.383448 ] Possible unsafe locking scenario: [ 2785.383448 ] [ 2785.384544 ] CPU0 CPU1 [ 2785.385383 ] ---- ---- [ 2785.386193 ] lock(&dev->cache.rb_lock); [ 2785.386940 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.388327 ] lock(&dev->cache.rb_lock); [ 2785.389425 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.390414 ] [ 2785.390414 ] *** DEADLOCK *** [ 2785.390414 ] [ 2785.391579 ] 6 locks held by devlink/53872: [ 2785.392341 ] #0: ffffffff84c17a50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 [ 2785.393630 ] #1: ffff888142280218 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0x12d/0x2d0 [ 2785.395324 ] #2: ffff8881422d3c38 (&dev->lock_key){+.+.}-{3:3}, at: mlx5_unload_one_devl_locked+0x4a/0x110 [mlx5_core] [ 2785.397322 ] #3: ffffffffa0e59068 (mlx5_intf_mutex){+.+.}-{3:3}, at: mlx5_detach_device+0x60/0x360 [mlx5_core] [ 2785.399231 ] #4: ffff88810e3cb0e8 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x8d/0x600 [ 2785.400864 ] #5: ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] Fixes: b958451 ("RDMA/mlx5: Change the cache structure to an RB-tree") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
borkmann
pushed a commit
that referenced
this pull request
Oct 17, 2023
Amit Cohen says: ==================== Extend VXLAN driver to support FDB flushing The merge commit 9271686 ("Merge branch 'br-flush-filtering'") added support for FDB flushing in bridge driver. Extend VXLAN driver to support FDB flushing also. Add support for filtering by fields which are relevant for VXLAN FDBs: * Source VNI * Nexthop ID * 'router' flag * Destination VNI * Destination Port * Destination IP Without this set, flush for VXLAN device fails: $ bridge fdb flush dev vx10 RTNETLINK answers: Operation not supported With this set, such flush works with the relevant arguments, for example: $ bridge fdb flush dev vx10 vni 5000 dst 193.2.2.1 < flush all vx10 entries with VNI 5000 and destination IP 193.2.2.1> Some preparations are required, handle them before adding flushing support in VXLAN driver. See more details in commit messages. Patch set overview: Patch #1 prepares flush policy to be used by VXLAN driver Patches #2-#3 are preparations in VXLAN driver Patch #4 adds an initial support for flushing in VXLAN driver Patches #5-#9 add support for filtering by several attributes Patch #10 adds a test for FDB flush with VXLAN Patch #11 extends the test to check FDB flush with bridge ==================== Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
added a commit
that referenced
this pull request
Oct 17, 2023
Andrii Nakryiko says: ==================== This patch set fixes ambiguity in BPF verifier log output of SCALAR register in the parts that emit umin/umax, smin/smax, etc ranges. See patch #4 for details. Also, patch #5 fixes an issue with verifier log missing instruction context (state) output for conditionals that trigger precision marking. See details in the patch. First two patches are just improvements to two selftests that are very flaky locally when run in parallel mode. Patch #3 changes 'align' selftest to be less strict about exact verifier log output (which patch #4 changes, breaking lots of align tests as written). Now test does more of a register substate checks, mostly around expected var_off() values. This 'align' selftests is one of the more brittle ones and requires constant adjustment when verifier log output changes, without really catching any new issues. So hopefully these changes can minimize future support efforts for this specific set of tests. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
borkmann
pushed a commit
that referenced
this pull request
Oct 20, 2023
Chuyi Zhou says: ==================== Add Open-coded task, css_task and css iters This is version 6 of task, css_task and css iters support. --- Changelog --- v5 -> v6: Patch #3: * In bpf_iter_task_next, return pos rather than goto out. (Andrii) Patch #2, #3, #4: * Add the missing __diag_ignore_all to avoid kernel build warning Patch #5, #6, #7: * Add Andrii's ack Patch #8: * In BPF prog iter_css_task_for_each, return -EPERM rather than 0, and ensure stack_mprotect() in iters.c not success. If not, it would cause the subsequent 'test_lsm' fail, since the 'is_stack' check in test_int_hook(lsm.c) would not be guaranteed. (https://github.com/kernel-patches/bpf/actions/runs/6489662214/job/17624665086?pr=5790) v4 -> v5:https://lore.kernel.org/lkml/20231007124522.34834-1-zhouchuyi@bytedance.com/ Patch 3~4: * Relax the BUILD_BUG_ON check in bpf_iter_task_new and bpf_iter_css_new to avoid netdev/build_32bit CI error. (https://netdev.bots.linux.dev/static/nipa/790929/13412333/build_32bit/stderr) Patch 8: * Initialize skel pointer to fix the LLVM-16 build CI error (https://github.com/kernel-patches/bpf/actions/runs/6462875618/job/17545170863) v3 -> v4:https://lore.kernel.org/all/20230925105552.817513-1-zhouchuyi@bytedance.com/ * Address all the comments from Andrii in patch-3 ~ patch-6 * Collect Tejun's ack * Add a extra patch to rename bpf_iter_task.c to bpf_iter_tasks.c * Seperate three BPF program files for selftests (iters_task.c iters_css_task.c iters_css.c) v2 -> v3:https://lore.kernel.org/lkml/20230912070149.969939-1-zhouchuyi@bytedance.com/ Patch 1 (cgroup: Prepare for using css_task_iter_*() in BPF) * Add tj's ack and Alexei's suggest-by. Patch 2 (bpf: Introduce css_task open-coded iterator kfuncs) * Use bpf_mem_alloc/bpf_mem_free rather than kzalloc() * Add KF_TRUSTED_ARGS for bpf_iter_css_task_new (Alexei) * Move bpf_iter_css_task's definition from uapi/linux/bpf.h to kernel/bpf/task_iter.c and we can use it from vmlinux.h * Move bpf_iter_css_task_XXX's declaration from bpf_helpers.h to bpf_experimental.h Patch 3 (Introduce task open coded iterator kfuncs) * Change th API design keep consistent with SEC("iter/task"), support iterating all threads(BPF_TASK_ITERATE_ALL) and threads of a specific task (BPF_TASK_ITERATE_THREAD).(Andrii) * Move bpf_iter_task's definition from uapi/linux/bpf.h to kernel/bpf/task_iter.c and we can use it from vmlinux.h * Move bpf_iter_task_XXX's declaration from bpf_helpers.h to bpf_experimental.h Patch 4 (Introduce css open-coded iterator kfuncs) * Change th API design keep consistent with cgroup_iters, reuse BPF_CGROUP_ITER_DESCENDANTS_PRE/BPF_CGROUP_ITER_DESCENDANTS_POST /BPF_CGROUP_ITER_ANCESTORS_UP(Andrii) * Add KF_TRUSTED_ARGS for bpf_iter_css_new * Move bpf_iter_css's definition from uapi/linux/bpf.h to kernel/bpf/task_iter.c and we can use it from vmlinux.h * Move bpf_iter_css_XXX's declaration from bpf_helpers.h to bpf_experimental.h Patch 5 (teach the verifier to enforce css_iter and task_iter in RCU CS) * Add KF flag KF_RCU_PROTECTED to maintain kfuncs which need RCU CS.(Andrii) * Consider STACK_ITER when using bpf_for_each_spilled_reg. Patch 6 (Let bpf_iter_task_new accept null task ptr) * Add this extra patch to let bpf_iter_task_new accept a 'nullable' * task pointer(Andrii) Patch 7 (selftests/bpf: Add tests for open-coded task and css iter) * Add failure testcase(Alexei) Changes from v1(https://lore.kernel.org/lkml/20230827072057.1591929-1-zhouchuyi@bytedance.com/): - Add a pre-patch to make some preparations before supporting css_task iters.(Alexei) - Add an allowlist for css_task iters(Alexei) - Let bpf progs do explicit bpf_rcu_read_lock() when using process iters and css_descendant iters.(Alexei) --------------------- In some BPF usage scenarios, it will be useful to iterate the process and css directly in the BPF program. One of the expected scenarios is customizable OOM victim selection via BPF[1]. Inspired by Dave's task_vma iter[2], this patchset adds three types of open-coded iterator kfuncs: 1. bpf_task_iters. It can be used to 1) iterate all process in the system, like for_each_forcess() in kernel. 2) iterate all threads in the system. 3) iterate all threads of a specific task 2. bpf_css_iters. It works like css_task_iter_{start, next, end} and would be used to iterating tasks/threads under a css. 3. css_iters. It works like css_next_descendant_{pre, post} to iterating all descendant css. BPF programs can use these kfuncs directly or through bpf_for_each macro. link[1]: https://lore.kernel.org/lkml/20230810081319.65668-1-zhouchuyi@bytedance.com/ link[2]: https://lore.kernel.org/all/20230810183513.684836-1-davemarchevsky@fb.com/ ==================== Link: https://lore.kernel.org/r/20231018061746.111364-1-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Oct 23, 2023
Hou Tao says: ==================== bpf: Fixes for per-cpu kptr From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to fix the problems found in the review of per-cpu kptr patch-set [0]. Patch #1 moves pcpu_lock after the invocation of pcpu_chunk_addr_search() and it is a micro-optimization for free_percpu(). The reason includes it in the patch is that the same logic is used in newly-added API pcpu_alloc_size(). Patch #2 introduces pcpu_alloc_size() for dynamic per-cpu area. Patch #2 and #3 use pcpu_alloc_size() to check whether or not unit_size matches with the size of underlying per-cpu area and to select a matching bpf_mem_cache. Patch #4 fixes the freeing of per-cpu kptr when these kptrs are freed by map destruction. The last patch adds test cases for these problems. Please see individual patches for details. And comments are always welcome. Change Log: v3: * rebased on bpf-next * patch 2: update API document to note that pcpu_alloc_size() doesn't support statically allocated per-cpu area. (Dennis) * patch 1 & 2: add Acked-by from Dennis v2: https://lore.kernel.org/bpf/20231018113343.2446300-1-houtao@huaweicloud.com/ * add a new patch "don't acquire pcpu_lock for pcpu_chunk_addr_search()" * patch 2: change type of bit_off and end to unsigned long (Andrew) * patch 2: rename the new API as pcpu_alloc_size and follow 80-column convention (Dennis) * patch 5: move the common declaration into bpf.h (Stanislav, Alxei) v1: https://lore.kernel.org/bpf/20231007135106.3031284-1-houtao@huaweicloud.com/ [0]: https://lore.kernel.org/bpf/20230827152729.1995219-1-yonghong.song@linux.dev ==================== Link: https://lore.kernel.org/r/20231020133202.4043247-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Oct 24, 2023
Eduard Zingerman says: ==================== exact states comparison for iterator convergence checks Iterator convergence logic in is_state_visited() uses state_equals() for states with branches counter > 0 to check if iterator based loop converges. This is not fully correct because state_equals() relies on presence of read and precision marks on registers. These marks are not guaranteed to be finalized while state has branches. Commit message for patch #3 describes a program that exhibits such behavior. This patch-set aims to fix iterator convergence logic by adding notion of exact states comparison. Exact comparison does not rely on presence of read or precision marks and thus is more strict. As explained in commit message for patch #3 exact comparisons require addition of speculative register bounds widening. The end result for BPF verifier users could be summarized as follows: (!) After this update verifier would reject programs that conjure an imprecise value on the first loop iteration and use it as precise on the second (for iterator based loops). I urge people to at least skim over the commit message for patch #3. Patches are organized as follows: - patches #1,2: moving/extracting utility functions; - patch #3: introduces exact mode for states comparison and adds widening heuristic; - patch #4: adds test-cases that demonstrate why the series is necessary; - patch #5: extends patch #3 with a notion of state loop entries, these entries have to be tracked to correctly identify that different verifier states belong to the same states loop; - patch #6: adds a test-case that demonstrates a program which requires loop entry tracking for correct verification; - patch #7: just adds a few debug prints. The following actions are planned as a followup for this patch-set: - implementation has to be adapted for callbacks handling logic as a part of a fix for [1]; - it is necessary to explore ways to improve widening heuristic to handle iters_task_vma test w/o need to insert barrier_var() calls; - explored states eviction logic on cache miss has to be extended to either: - allow eviction of checkpoint states -or- - be sped up in case if there are many active checkpoints associated with the same instruction. The patch-set is a followup for mailing list discussion [1]. Changelog: - V2 [3] -> V3: - correct check for stack spills in widen_imprecise_scalars(), added test case progs/iters.c:widen_spill to check the behavior (suggested by Andrii); - allow eviction of checkpoint states in is_state_visited() to avoid pathological verifier performance when iterator based loop does not converge (discussion with Alexei). - V1 [2] -> V2, applied changes suggested by Alexei offlist: - __explored_state() function removed; - same_callsites() function is now used in clean_live_states(); - patches #1,2 are added as preparatory code movement; - in process_iter_next_call() a safeguard is added to verify that cur_st->parent exists and has expected insn index / call sites. [1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/ [2] https://lore.kernel.org/bpf/20231021005939.1041-1-eddyz87@gmail.com/ [3] https://lore.kernel.org/bpf/20231022010812.9201-1-eddyz87@gmail.com/ ==================== Link: https://lore.kernel.org/r/20231024000917.12153-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Oct 27, 2023
Lockdep reports following issue: WARNING: possible circular locking dependency detected ------------------------------------------------------ devlink/8191 is trying to acquire lock: ffff88813f32c250 (&devlink->lock_key#14){+.+.}-{3:3}, at: devlink_rel_devlink_handle_put+0x11e/0x2d0 but task is already holding lock: ffffffff8511eca8 (rtnl_mutex){+.+.}-{3:3}, at: unregister_netdev+0xe/0x20 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (rtnl_mutex){+.+.}-{3:3}: lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 register_netdevice_notifier_net+0x13/0x30 mlx5_lag_add_mdev+0x51c/0xa00 [mlx5_core] mlx5_load+0x222/0xc70 [mlx5_core] mlx5_init_one_devl_locked+0x4a0/0x1310 [mlx5_core] mlx5_init_one+0x3b/0x60 [mlx5_core] probe_one+0x786/0xd00 [mlx5_core] local_pci_probe+0xd7/0x180 pci_device_probe+0x231/0x720 really_probe+0x1e4/0xb60 __driver_probe_device+0x261/0x470 driver_probe_device+0x49/0x130 __driver_attach+0x215/0x4c0 bus_for_each_dev+0xf0/0x170 bus_add_driver+0x21d/0x590 driver_register+0x133/0x460 vdpa_match_remove+0x89/0xc0 [vdpa] do_one_initcall+0xc4/0x360 do_init_module+0x22d/0x760 load_module+0x51d7/0x6750 init_module_from_file+0xd2/0x130 idempotent_init_module+0x326/0x5a0 __x64_sys_finit_module+0xc1/0x130 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 -> #2 (mlx5_intf_mutex){+.+.}-{3:3}: lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 mlx5_register_device+0x3e/0xd0 [mlx5_core] mlx5_init_one_devl_locked+0x8fa/0x1310 [mlx5_core] mlx5_devlink_reload_up+0x147/0x170 [mlx5_core] devlink_reload+0x203/0x380 devlink_nl_cmd_reload+0xb84/0x10e0 genl_family_rcv_msg_doit+0x1cc/0x2a0 genl_rcv_msg+0x3c9/0x670 netlink_rcv_skb+0x12c/0x360 genl_rcv+0x24/0x40 netlink_unicast+0x435/0x6f0 netlink_sendmsg+0x7a0/0xc70 sock_sendmsg+0xc5/0x190 __sys_sendto+0x1c8/0x290 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 -> #1 (&dev->lock_key#8){+.+.}-{3:3}: lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 mlx5_init_one_devl_locked+0x45/0x1310 [mlx5_core] mlx5_devlink_reload_up+0x147/0x170 [mlx5_core] devlink_reload+0x203/0x380 devlink_nl_cmd_reload+0xb84/0x10e0 genl_family_rcv_msg_doit+0x1cc/0x2a0 genl_rcv_msg+0x3c9/0x670 netlink_rcv_skb+0x12c/0x360 genl_rcv+0x24/0x40 netlink_unicast+0x435/0x6f0 netlink_sendmsg+0x7a0/0xc70 sock_sendmsg+0xc5/0x190 __sys_sendto+0x1c8/0x290 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 -> #0 (&devlink->lock_key#14){+.+.}-{3:3}: check_prev_add+0x1af/0x2300 __lock_acquire+0x31d7/0x4eb0 lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 devlink_rel_devlink_handle_put+0x11e/0x2d0 devlink_nl_port_fill+0xddf/0x1b00 devlink_port_notify+0xb5/0x220 __devlink_port_type_set+0x151/0x510 devlink_port_netdevice_event+0x17c/0x220 notifier_call_chain+0x97/0x240 unregister_netdevice_many_notify+0x876/0x1790 unregister_netdevice_queue+0x274/0x350 unregister_netdev+0x18/0x20 mlx5e_vport_rep_unload+0xc5/0x1c0 [mlx5_core] __esw_offloads_unload_rep+0xd8/0x130 [mlx5_core] mlx5_esw_offloads_rep_unload+0x52/0x70 [mlx5_core] mlx5_esw_offloads_unload_rep+0x85/0xc0 [mlx5_core] mlx5_eswitch_unload_sf_vport+0x41/0x90 [mlx5_core] mlx5_devlink_sf_port_del+0x120/0x280 [mlx5_core] genl_family_rcv_msg_doit+0x1cc/0x2a0 genl_rcv_msg+0x3c9/0x670 netlink_rcv_skb+0x12c/0x360 genl_rcv+0x24/0x40 netlink_unicast+0x435/0x6f0 netlink_sendmsg+0x7a0/0xc70 sock_sendmsg+0xc5/0x190 __sys_sendto+0x1c8/0x290 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 other info that might help us debug this: Chain exists of: &devlink->lock_key#14 --> mlx5_intf_mutex --> rtnl_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(rtnl_mutex); lock(mlx5_intf_mutex); lock(rtnl_mutex); lock(&devlink->lock_key#14); Problem is taking the devlink instance lock of nested instance when RTNL is already held. To fix this, don't take the devlink instance lock when putting nested handle. Instead, rely on the preparations done by previous two patches to be able to access device pointer and obtain netns id without devlink instance lock held. Fixes: c137743 ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Oct 27, 2023
Jiri Pirko says: ==================== devlink: fix a deadlock when taking devlink instance lock while holding RTNL lock devlink_port_fill() may be called sometimes with RTNL lock held. When putting the nested port function devlink instance attrs, current code takes nested devlink instance lock. In that case lock ordering is wrong. Patch #1 is a dependency of patch #2. Patch #2 converts the peernet2id_alloc() call to rely in RCU so it could called without devlink instance lock. Patch #3 takes device reference for devlink instance making sure that device does not disappear before devlink_release() is called. Patch #4 benefits from the preparations done in patches #2 and #3 and removes the problematic nested devlink lock aquisition. Patched #5-#7 improve documentation to reflect this issue so it is avoided in the future. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.