forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 1
[CI run] selftests/bpf: Don't assign outer source IP to host #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit extends the ip_tunnel_key struct with a new field for the flow flags, to pass them to the route lookups. This new field will be populated and used in subsequent commits. Signed-off-by: Paul Chaignon <paul@isovalent.com>
Use the new ip_tunnel_key field with the flow flags in the route lookups for the encapsulated packet. This will be used by the bpf_skb_set_tunnel_key helper in a subsequent commit. Signed-off-by: Paul Chaignon <paul@isovalent.com>
Use the new ip_tunnel_key field with the flow flags in the route lookups for the encapsulated packet. This will be used by the bpf_skb_set_tunnel_key helper in the subsequent commit. Signed-off-by: Paul Chaignon <paul@isovalent.com>
Commit 26101f5 ("bpf: Add source ip in "struct bpf_tunnel_key"") added support for getting and setting the outer source IP of encapsulated packets via the bpf_skb_{get,set}_tunnel_key BPF helper. This change allows BPF programs to set any IP address as the source, including for example the IP address of a container running on the same host. In that last case, however, the encapsulated packets are dropped when looking up the route because the source IP address isn't assigned to any interface on the host. To avoid this, we need to set the FLOWI_FLAG_ANYSRC flag. Fixes: 26101f5 ("bpf: Add source ip in "struct bpf_tunnel_key"") Signed-off-by: Paul Chaignon <paul@isovalent.com>
The previous commit fixed a bug in the bpf_skb_set_tunnel_key helper to avoid dropping packets whose outer source IP address isn't assigned to a host interface. This commit changes the corresponding selftest to not assign the outer source IP address to an interface. With this change and without the bugfix, the ICMP echo packets sent as part of the test are dropped. Signed-off-by: Paul Chaignon <paul@isovalent.com>
borkmann
pushed a commit
that referenced
this pull request
Apr 30, 2023
This is a partial revert of the blamed commit, with a relevant change: mptcp_subflow_queue_clean() now just change the msk socket status and stop the worker, so that the UaF issue addressed by the blamed commit is not re-introduced. The above prevents the mptcp worker from running concurrently with inet_csk_listen_stop(), as such race would trigger a warning, as reported by Christoph: RSP: 002b:00007f784fe09cd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e WARNING: CPU: 0 PID: 25807 at net/ipv4/inet_connection_sock.c:1387 inet_csk_listen_stop+0x664/0x870 net/ipv4/inet_connection_sock.c:1387 RAX: ffffffffffffffda RBX: 00000000006bc050 RCX: 00007f7850afd6a9 RDX: 0000000000000000 RSI: 0000000020000340 RDI: 0000000000000004 Modules linked in: RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006bc05c R13: fffffffffffffea8 R14: 00000000006bc050 R15: 000000000001fe40 </TASK> CPU: 0 PID: 25807 Comm: syz-executor.7 Not tainted 6.2.0-g778e54711659 #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:inet_csk_listen_stop+0x664/0x870 net/ipv4/inet_connection_sock.c:1387 RAX: 0000000000000000 RBX: ffff888100dfbd40 RCX: 0000000000000000 RDX: ffff8881363aab80 RSI: ffffffff81c494f4 RDI: 0000000000000005 RBP: ffff888126dad080 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff888100dfe040 R13: 0000000000000001 R14: 0000000000000000 R15: ffff888100dfbdd8 FS: 00007f7850a2c800(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b32d26000 CR3: 000000012fdd8006 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> __tcp_close+0x5b2/0x620 net/ipv4/tcp.c:2875 __mptcp_close_ssk+0x145/0x3d0 net/mptcp/protocol.c:2427 mptcp_destroy_common+0x8a/0x1c0 net/mptcp/protocol.c:3277 mptcp_destroy+0x41/0x60 net/mptcp/protocol.c:3304 __mptcp_destroy_sock+0x56/0x140 net/mptcp/protocol.c:2965 __mptcp_close+0x38f/0x4a0 net/mptcp/protocol.c:3057 mptcp_close+0x24/0xe0 net/mptcp/protocol.c:3072 inet_release+0x53/0xa0 net/ipv4/af_inet.c:429 __sock_release+0x4e/0xf0 net/socket.c:651 sock_close+0x15/0x20 net/socket.c:1393 __fput+0xff/0x420 fs/file_table.c:321 task_work_run+0x8b/0xe0 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x113/0x120 kernel/entry/common.c:203 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x1d/0x40 kernel/entry/common.c:296 do_syscall_64+0x46/0x90 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f7850af70dc RAX: 0000000000000000 RBX: 0000000000000004 RCX: 00007f7850af70dc RDX: 00007f7850a2c800 RSI: 0000000000000002 RDI: 0000000000000003 RBP: 00000000006bd980 R08: 0000000000000000 R09: 00000000000018a0 R10: 00000000316338a4 R11: 0000000000000293 R12: 0000000000211e31 R13: 00000000006bc05c R14: 00007f785062c000 R15: 0000000000211af0 Fixes: 0a3f4f1 ("mptcp: fix UaF in listener shutdown") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Link: multipath-tcp/mptcp_net-next#371 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Apr 30, 2023
Ido Schimmel says: ==================== bridge: Add per-{Port, VLAN} neighbor suppression Background ========== In order to minimize the flooding of ARP and ND messages in the VXLAN network, EVPN includes provisions [1] that allow participating VTEPs to suppress such messages in case they know the MAC-IP binding and can reply on behalf of the remote host. In Linux, the above is implemented in the bridge driver using a per-port option called "neigh_suppress" that was added in kernel version 4.15 [2]. Motivation ========== Some applications use ARP messages as keepalives between the application nodes in the network. This works perfectly well when two nodes are connected to the same VTEP. When a node goes down it will stop responding to ARP requests and the other node will notice it immediately. However, when the two nodes are connected to different VTEPs and neighbor suppression is enabled, the local VTEP will reply to ARP requests even after the remote node went down, until certain timers expire and the EVPN control plane decides to withdraw the MAC/IP Advertisement route for the address. Therefore, some users would like to be able to disable neighbor suppression on VLANs where such applications reside and keep it enabled on the rest. Implementation ============== The proposed solution is to allow user space to control neighbor suppression on a per-{Port, VLAN} basis, in a similar fashion to other per-port options that gained per-{Port, VLAN} counterparts such as "mcast_router". This allows users to benefit from the operational simplicity and scalability associated with shared VXLAN devices (i.e., external / collect-metadata mode), while still allowing for per-VLAN/VNI neighbor suppression control. The user interface is extended with a new "neigh_vlan_suppress" bridge port option that allows user space to enable per-{Port, VLAN} neighbor suppression on the bridge port. When enabled, the existing "neigh_suppress" option has no effect and neighbor suppression is controlled using a new "neigh_suppress" VLAN option. Example usage: # bridge link set dev vxlan0 neigh_vlan_suppress on # bridge vlan add vid 10 dev vxlan0 # bridge vlan set vid 10 dev vxlan0 neigh_suppress on Testing ======= Tested using existing bridge selftests. Added a dedicated selftest in the last patch. Patchset overview ================= Patches #1-#5 are preparations. Patch #6 adds per-{Port, VLAN} neighbor suppression support to the bridge's data path. Patches #7-#8 add the required netlink attributes to enable the feature. Patch #9 adds a selftest. iproute2 patches can be found here [3]. Changelog ========= Since RFC [4]: No changes. [1] https://www.rfc-editor.org/rfc/rfc7432#section-10 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a42317785c898c0ed46db45a33b0cc71b671bf29 [3] https://github.com/idosch/iproute2/tree/submit/neigh_suppress_v1 [4] https://lore.kernel.org/netdev/20230413095830.2182382-1-idosch@nvidia.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
lmb
pushed a commit
that referenced
this pull request
May 23, 2023
In the function ieee80211_tx_dequeue() there is a particular locking sequence: begin: spin_lock(&local->queue_stop_reason_lock); q_stopped = local->queue_stop_reasons[q]; spin_unlock(&local->queue_stop_reason_lock); However small the chance (increased by ftracetest), an asynchronous interrupt can occur in between of spin_lock() and spin_unlock(), and the interrupt routine will attempt to lock the same &local->queue_stop_reason_lock again. This will cause a costly reset of the CPU and the wifi device or an altogether hang in the single CPU and single core scenario. The only remaining spin_lock(&local->queue_stop_reason_lock) that did not disable interrupts was patched, which should prevent any deadlocks on the same CPU/core and the same wifi device. This is the probable trace of the deadlock: kernel: ================================ kernel: WARNING: inconsistent lock state kernel: 6.3.0-rc6-mt-20230401-00001-gf86822a1170f #4 Tainted: G W kernel: -------------------------------- kernel: inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. kernel: kworker/5:0/25656 [HC0[0]:SC0[0]:HE1:SE1] takes: kernel: ffff9d6190779478 (&local->queue_stop_reason_lock){+.?.}-{2:2}, at: return_to_handler+0x0/0x40 kernel: {IN-SOFTIRQ-W} state was registered at: kernel: lock_acquire+0xc7/0x2d0 kernel: _raw_spin_lock+0x36/0x50 kernel: ieee80211_tx_dequeue+0xb4/0x1330 [mac80211] kernel: iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm] kernel: iwl_mvm_mac_wake_tx_queue+0x2d/0xd0 [iwlmvm] kernel: ieee80211_queue_skb+0x450/0x730 [mac80211] kernel: __ieee80211_xmit_fast.constprop.66+0x834/0xa50 [mac80211] kernel: __ieee80211_subif_start_xmit+0x217/0x530 [mac80211] kernel: ieee80211_subif_start_xmit+0x60/0x580 [mac80211] kernel: dev_hard_start_xmit+0xb5/0x260 kernel: __dev_queue_xmit+0xdbe/0x1200 kernel: neigh_resolve_output+0x166/0x260 kernel: ip_finish_output2+0x216/0xb80 kernel: __ip_finish_output+0x2a4/0x4d0 kernel: ip_finish_output+0x2d/0xd0 kernel: ip_output+0x82/0x2b0 kernel: ip_local_out+0xec/0x110 kernel: igmpv3_sendpack+0x5c/0x90 kernel: igmp_ifc_timer_expire+0x26e/0x4e0 kernel: call_timer_fn+0xa5/0x230 kernel: run_timer_softirq+0x27f/0x550 kernel: __do_softirq+0xb4/0x3a4 kernel: irq_exit_rcu+0x9b/0xc0 kernel: sysvec_apic_timer_interrupt+0x80/0xa0 kernel: asm_sysvec_apic_timer_interrupt+0x1f/0x30 kernel: _raw_spin_unlock_irqrestore+0x3f/0x70 kernel: free_to_partial_list+0x3d6/0x590 kernel: __slab_free+0x1b7/0x310 kernel: kmem_cache_free+0x52d/0x550 kernel: putname+0x5d/0x70 kernel: do_sys_openat2+0x1d7/0x310 kernel: do_sys_open+0x51/0x80 kernel: __x64_sys_openat+0x24/0x30 kernel: do_syscall_64+0x5c/0x90 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc kernel: irq event stamp: 5120729 kernel: hardirqs last enabled at (5120729): [<ffffffff9d149936>] trace_graph_return+0xd6/0x120 kernel: hardirqs last disabled at (5120728): [<ffffffff9d149950>] trace_graph_return+0xf0/0x120 kernel: softirqs last enabled at (5069900): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40 kernel: softirqs last disabled at (5067555): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40 kernel: other info that might help us debug this: kernel: Possible unsafe locking scenario: kernel: CPU0 kernel: ---- kernel: lock(&local->queue_stop_reason_lock); kernel: <Interrupt> kernel: lock(&local->queue_stop_reason_lock); kernel: *** DEADLOCK *** kernel: 8 locks held by kworker/5:0/25656: kernel: #0: ffff9d618009d138 ((wq_completion)events_freezable){+.+.}-{0:0}, at: process_one_work+0x1ca/0x530 kernel: #1: ffffb1ef4637fe68 ((work_completion)(&local->restart_work)){+.+.}-{0:0}, at: process_one_work+0x1ce/0x530 kernel: #2: ffffffff9f166548 (rtnl_mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #3: ffff9d6190778728 (&rdev->wiphy.mtx){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #4: ffff9d619077b480 (&mvm->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #5: ffff9d61907bacd8 (&trans_pcie->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #6: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_queue_state_change+0x59/0x3a0 [iwlmvm] kernel: #7: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm] kernel: stack backtrace: kernel: CPU: 5 PID: 25656 Comm: kworker/5:0 Tainted: G W 6.3.0-rc6-mt-20230401-00001-gf86822a1170f #4 kernel: Hardware name: LENOVO 82H8/LNVNB161216, BIOS GGCN51WW 11/16/2022 kernel: Workqueue: events_freezable ieee80211_restart_work [mac80211] kernel: Call Trace: kernel: <TASK> kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: dump_stack_lvl+0x5f/0xa0 kernel: dump_stack+0x14/0x20 kernel: print_usage_bug.part.46+0x208/0x2a0 kernel: mark_lock.part.47+0x605/0x630 kernel: ? sched_clock+0xd/0x20 kernel: ? trace_clock_local+0x14/0x30 kernel: ? __rb_reserve_next+0x5f/0x490 kernel: ? _raw_spin_lock+0x1b/0x50 kernel: __lock_acquire+0x464/0x1990 kernel: ? mark_held_locks+0x4e/0x80 kernel: lock_acquire+0xc7/0x2d0 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ? ftrace_return_to_handler+0x8b/0x100 kernel: ? preempt_count_add+0x4/0x70 kernel: _raw_spin_lock+0x36/0x50 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_tx_dequeue+0xb4/0x1330 [mac80211] kernel: ? prepare_ftrace_return+0xc5/0x190 kernel: ? ftrace_graph_func+0x16/0x20 kernel: ? 0xffffffffc02ab0b1 kernel: ? lock_acquire+0xc7/0x2d0 kernel: ? iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm] kernel: ? ieee80211_tx_dequeue+0x9/0x1330 [mac80211] kernel: ? __rcu_read_lock+0x4/0x40 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_queue_state_change+0x311/0x3a0 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_wake_sw_queue+0x17/0x20 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_unmap+0x1c9/0x1f0 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_free+0x55/0x130 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_tx_free+0x63/0x80 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: _iwl_trans_pcie_gen2_stop_device+0x3f3/0x5b0 [iwlwifi] kernel: ? _iwl_trans_pcie_gen2_stop_device+0x9/0x5b0 [iwlwifi] kernel: ? mutex_lock_nested+0x4/0x30 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_trans_pcie_gen2_stop_device+0x5f/0x90 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_stop_device+0x78/0xd0 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: __iwl_mvm_mac_start+0x114/0x210 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_mac_start+0x76/0x150 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: drv_start+0x79/0x180 [mac80211] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_reconfig+0x1523/0x1ce0 [mac80211] kernel: ? synchronize_net+0x4/0x50 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_restart_work+0x108/0x170 [mac80211] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: process_one_work+0x250/0x530 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: worker_thread+0x48/0x3a0 kernel: ? __pfx_worker_thread+0x10/0x10 kernel: kthread+0x10f/0x140 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork+0x29/0x50 kernel: </TASK> Fixes: 4444bc2 ("wifi: mac80211: Proper mark iTXQs for resumption") Link: https://lore.kernel.org/all/1f58a0d1-d2b9-d851-73c3-93fcc607501c@alu.unizg.hr/ Reported-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr> Cc: Gregory Greenman <gregory.greenman@intel.com> Cc: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/all/cdc80531-f25f-6f9d-b15f-25e16130b53a@alu.unizg.hr/ Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Alexander Wetzel <alexander@wetzel-home.de> Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: tag, or it goes automatically? Link: https://lore.kernel.org/r/20230425164005.25272-1-mirsad.todorovac@alu.unizg.hr Signed-off-by: Johannes Berg <johannes.berg@intel.com>
lmb
pushed a commit
that referenced
this pull request
May 23, 2023
Magnus Karlsson says: ==================== Prepare the AF_XDP selftests test framework code for the upcoming multi-buffer support in AF_XDP. This so that the multi-buffer patch set does not become way too large. In that upcoming patch set, we are only including the multi-buffer tests together with any framework code that depends on the new options bit introduced in the AF_XDP multi-buffer implementation itself. Currently, the test framework is based on the premise that a packet consists of a single fragment and thus occupies a single buffer and a single descriptor. Multi-buffer breaks this assumption, as that is the whole purpose of it. Now, a packet can consist of multiple buffers and therefore consume multiple descriptors. The patch set starts with some clean-ups and simplifications followed by patches that make sure that the current code works even when a packet occupies multiple buffers. The actual code for sending and receiving multi-buffer packets will be included in the AF_XDP multi-buffer patch set as it depends on a new bit being used in the options field of the descriptor. Patch set anatomy: 1: The XDP program was unnecessarily changed many times. Fixes this. 2: There is no reason to generate a full UDP/IPv4 packet as it is never used. Simplify the code by just generating a valid Ethernet frame. 3: Introduce a more complicated payload pattern that can detect fragments out of bounds in a multi-buffer packet and other errors found in single-fragment packets. 4: As a convenience, dump the content of the faulty packet at error. 5: To simplify the code, make the usage of the packet stream for Tx and Rx more similar. 6: Store the offset of the packet in the buffer in the struct pkt definition instead of the address in the umem itself and introduce a simple buffer allocator. The address only made sense when all packets consumed a single buffer. Now, we do not know beforehand how many buffers a packet will consume, so we instead just allocate a buffer from the allocator and specify the offset within that buffer. 7: Test for huge pages only once instead of before each test that needs it. 8: Populate the fill ring based on how many frags are needed for each packet. 9: Change the data generation code so it can generate data for multi-buffer packets too. 10: Adjust the packet pacing algorithm so that it can cope with multi-buffer packets. The pacing algorithm is present so that Tx does not send too many packets/frames to Rx that it starts to drop packets. That would ruin the tests. v1 -> v2: * Fixed spelling error in patch #6 [Simon] * Fixed compilation error with llvm in patch #7 [Daniel] Thanks: Magnus ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
lmb
pushed a commit
that referenced
this pull request
Jun 20, 2023
The cited commit adds a compeletion to remove dependency on rtnl lock. But it causes a deadlock for multiple encapsulations: crash> bt ffff8aece8a64000 PID: 1514557 TASK: ffff8aece8a64000 CPU: 3 COMMAND: "tc" #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418 #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898 #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8 #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core] #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core] #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core] #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core] #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core] #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core] #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core] #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8 torvalds#13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower] torvalds#14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower] torvalds#15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047 torvalds#16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31 torvalds#17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853 torvalds#18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835 torvalds#19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27 torvalds#20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245 torvalds#21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482 torvalds#22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a torvalds#23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2 torvalds#24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2 torvalds#25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f torvalds#26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8 torvalds#27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c crash> bt 0xffff8aeb07544000 PID: 1110766 TASK: ffff8aeb07544000 CPU: 0 COMMAND: "kworker/u20:9" #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418 #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88 #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core] #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core] #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core] #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012 #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f After the first encap is attached, flow will be added to encap entry's flows list. If neigh update is running at this time, the following encaps of the flow can't hold the encap_tbl_lock and sleep. If neigh update thread is waiting for that flow's init_done, deadlock happens. Fix it by holding lock outside of the for loop. If neigh update is running, prevent encap flows from offloading. Since the lock is held outside of the for loop, concurrent creation of encap entries is not allowed. So remove unnecessary wait_for_completion call for res_ready. Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
lmb
pushed a commit
that referenced
this pull request
Jun 20, 2023
A remote DoS vulnerability of RPL Source Routing is assigned CVE-2023-2156. The Source Routing Header (SRH) has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Hdr Ext Len | Routing Type | Segments Left | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CmprI | CmprE | Pad | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . . . Addresses[1..n] . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The originator of an SRH places the first hop's IPv6 address in the IPv6 header's IPv6 Destination Address and the second hop's IPv6 address as the first address in Addresses[1..n]. The CmprI and CmprE fields indicate the number of prefix octets that are shared with the IPv6 Destination Address. When CmprI or CmprE is not 0, Addresses[1..n] are compressed as follows: 1..n-1 : (16 - CmprI) bytes n : (16 - CmprE) bytes Segments Left indicates the number of route segments remaining. When the value is not zero, the SRH is forwarded to the next hop. Its address is extracted from Addresses[n - Segment Left + 1] and swapped with IPv6 Destination Address. When Segment Left is greater than or equal to 2, the size of SRH is not changed because Addresses[1..n-1] are decompressed and recompressed with CmprI. OTOH, when Segment Left changes from 1 to 0, the new SRH could have a different size because Addresses[1..n-1] are decompressed with CmprI and recompressed with CmprE. Let's say CmprI is 15 and CmprE is 0. When we receive SRH with Segment Left >= 2, Addresses[1..n-1] have 1 byte for each, and Addresses[n] has 16 bytes. When Segment Left is 1, Addresses[1..n-1] is decompressed to 16 bytes and not recompressed. Finally, the new SRH will need more room in the header, and the size is (16 - 1) * (n - 1) bytes. Here the max value of n is 255 as Segment Left is u8, so in the worst case, we have to allocate 3825 bytes in the skb headroom. However, now we only allocate a small fixed buffer that is IPV6_RPL_SRH_WORST_SWAP_SIZE (16 + 7 bytes). If the decompressed size overflows the room, skb_push() hits BUG() below [0]. Instead of allocating the fixed buffer for every packet, let's allocate enough headroom only when we receive SRH with Segment Left 1. [0]: skbuff: skb_under_panic: text:ffffffff81c9f6e2 len:576 put:576 head:ffff8880070b5180 data:ffff8880070b4fb0 tail:0x70 end:0x140 dev:lo kernel BUG at net/core/skbuff.c:200! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 154 Comm: python3 Not tainted 6.4.0-rc4-00190-gc308e9ec0047 #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:skb_panic (net/core/skbuff.c:200) Code: 4f 70 50 8b 87 bc 00 00 00 50 8b 87 b8 00 00 00 50 ff b7 c8 00 00 00 4c 8b 8f c0 00 00 00 48 c7 c7 80 6e 77 82 e8 ad 8b 60 ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffc90000003da0 EFLAGS: 00000246 RAX: 0000000000000085 RBX: ffff8880058a6600 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88807dc1c540 RDI: ffff88807dc1c540 RBP: ffffc90000003e48 R08: ffffffff82b392c8 R09: 00000000ffffdfff R10: ffffffff82a592e0 R11: ffffffff82b092e0 R12: ffff888005b1c800 R13: ffff8880070b51b8 R14: ffff888005b1ca18 R15: ffff8880070b5190 FS: 00007f4539f0b740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055670baf3000 CR3: 0000000005b0e000 CR4: 00000000007506f0 PKRU: 55555554 Call Trace: <IRQ> skb_push (net/core/skbuff.c:210) ipv6_rthdr_rcv (./include/linux/skbuff.h:2880 net/ipv6/exthdrs.c:634 net/ipv6/exthdrs.c:718) ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:437 (discriminator 5)) ip6_input_finish (./include/linux/rcupdate.h:805 net/ipv6/ip6_input.c:483) __netif_receive_skb_one_core (net/core/dev.c:5494) process_backlog (./include/linux/rcupdate.h:805 net/core/dev.c:5934) __napi_poll (net/core/dev.c:6496) net_rx_action (net/core/dev.c:6565 net/core/dev.c:6696) __do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:572) do_softirq (kernel/softirq.c:472 kernel/softirq.c:459) </IRQ> <TASK> __local_bh_enable_ip (kernel/softirq.c:396) __dev_queue_xmit (net/core/dev.c:4272) ip6_finish_output2 (./include/net/neighbour.h:544 net/ipv6/ip6_output.c:134) rawv6_sendmsg (./include/net/dst.h:458 ./include/linux/netfilter.h:303 net/ipv6/raw.c:656 net/ipv6/raw.c:914) sock_sendmsg (net/socket.c:724 net/socket.c:747) __sys_sendto (net/socket.c:2144) __x64_sys_sendto (net/socket.c:2156 net/socket.c:2152 net/socket.c:2152) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) RIP: 0033:0x7f453a138aea Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89 RSP: 002b:00007ffcc212a1c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007ffcc212a288 RCX: 00007f453a138aea RDX: 0000000000000060 RSI: 00007f4539084c20 RDI: 0000000000000003 RBP: 00007f4538308e80 R08: 00007ffcc212a300 R09: 000000000000001c R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007f4539712d1b </TASK> Modules linked in: Fixes: 8610c7c ("net: ipv6: add support for rpl sr exthdr") Reported-by: Max VA Closes: https://www.interruptlabs.co.uk/articles/linux-ipv6-route-of-death Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230605180617.67284-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
lmb
pushed a commit
that referenced
this pull request
Jun 20, 2023
Currently, the per cpu upcall counters are allocated after the vport is created and inserted into the system. This could lead to the datapath accessing the counters before they are allocated resulting in a kernel Oops. Here is an example: PID: 59693 TASK: ffff0005f4f51500 CPU: 0 COMMAND: "ovs-vswitchd" #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4 #1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc #2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60 #3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58 #4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388 #5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c #6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68 #7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch] ... PID: 58682 TASK: ffff0005b2f0bf00 CPU: 0 COMMAND: "kworker/0:3" #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758 #1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994 #2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8 #3 [ffff80000a5d3120] die at ffffb70f0628234c #4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8 #5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4 #6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4 #7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710 #8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74 #9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac #10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24 #11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc #12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch] torvalds#13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch] torvalds#14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch] torvalds#15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch] torvalds#16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch] torvalds#17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90 We moved the per cpu upcall counter allocation to the existing vport alloc and free functions to solve this. Fixes: 95637d9 ("net: openvswitch: release vport resources on failure") Fixes: 1933ea3 ("net: openvswitch: Add support to count upcall packets") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Jul 19, 2023
Petr Machata says: ==================== mlxsw: Manage RIF across PVID changes The mlxsw driver currently makes the assumption that the user applies configuration in a bottom-up manner. Thus netdevices need to be added to the bridge before IP addresses are configured on that bridge or SVI added on top of it. Enslaving a netdevice to another netdevice that already has uppers is in fact forbidden by mlxsw for this reason. Despite this safety, it is rather easy to get into situations where the offloaded configuration is just plain wrong. As an example, take a front panel port, configure an IP address: it gets a RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the port from the bridge again, but the RIF never comes back. There is a number of similar situations, where changing the configuration there and back utterly breaks the offload. The situation is going to be made better by implementing a range of replays and post-hoc offloads. In this patch set, address the ordering issues related to creation of bridge RIFs. Currently, mlxsw has several shortcomings with regards to RIF handling due to PVID changes: - In order to cause RIF for a bridge device to be created, the user is expected first to set PVID, then to add an IP address. The reverse ordering is disallowed, which is not very user-friendly. - When such bridge gets a VLAN upper whose VID was the same as the existing PVID, and this VLAN netdevice gets an IP address, a RIF is created for this netdevice. The new RIF is then assigned to the 802.1Q FID for the given VID. This results in a working configuration. However, then, when the VLAN netdevice is removed again, the RIF for the bridge itself is never reassociated to the PVID. - PVID cannot be changed once the bridge has uppers. Presumably this is because the driver does not manage RIFs properly in face of PVID changes. However, as the previous point shows, it is still possible to get into invalid configurations. This patch set addresses these issues and relaxes some of the ordering requirements that mlxsw had. The patch set proceeds as follows: - In patch #1, pass extack to mlxsw_sp_br_ban_rif_pvid_change() - To relax ordering between setting PVID and adding an IP address to a bridge, mlxsw must be able to request that a RIF is created with a given VLAN ID, instead of trying to deduce it from the current netdevice settings, which do not reflect the user-requested values yet. This is done in patches #2 and #3. - Similarly, mlxsw_sp_inetaddr_bridge_event() will need to make decisions based on the user-requested value of PVID, not the current value. Thus in patches #4 and #5, add a new argument which carries the requested PVID value. - Finally in patch #6 relax the ban on PVID changes when a bridge has uppers. Instead, add the logic necessary for creation of a RIF as a result of PVID change. - Relevant selftests are presented afterwards. In patch #7 a preparatory helper is added to lib.sh. Patches #8, #9, #10 and #11 include selftests themselves. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Jul 28, 2023
Petr Machata says: ==================== mlxsw: Permit enslavement to netdevices with uppers The mlxsw driver currently makes the assumption that the user applies configuration in a bottom-up manner. Thus netdevices need to be added to the bridge before IP addresses are configured on that bridge or SVI added on top of it. Enslaving a netdevice to another netdevice that already has uppers is in fact forbidden by mlxsw for this reason. Despite this safety, it is rather easy to get into situations where the offloaded configuration is just plain wrong. As an example, take a front panel port, configure an IP address: it gets a RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the port from the bridge again, but the RIF never comes back. There is a number of similar situations, where changing the configuration there and back utterly breaks the offload. Similarly, detaching a front panel port from a configured topology means unoffloading of this whole topology -- VLAN uppers, next hops, etc. Attaching the port back is then not permitted at all. If it were, it would not result in a working configuration, because much of mlxsw is written to react to changes in immediate configuration. There is nothing that would go visit netdevices in the attached-to topology and offload existing routes and VLAN memberships, for example. In this patchset, introduce a number of replays to be invoked so that this sort of post-hoc offload is supported. Then remove the vetoes that disallowed enslavement of front panel ports to other netdevices with uppers. The patchset progresses as follows: - In patch #1, fix an issue in the bridge driver. To my knowledge, the issue could not have resulted in a buggy behavior previously, and thus is packaged with this patchset instead of being sent separately to net. - In patch #2, add a new helper to the switchdev code. - In patch #3, drop mlxsw selftests that will not be relevant after this patchset anymore. - Patches #4, #5, #6, #7 and #8 prepare the codebase for smoother introduction of the rest of the code. - Patches #9, #10, #11, #12, torvalds#13 and torvalds#14 replay various aspects of upper configuration when a front panel port is introduced into a topology. Individual patches take care of bridge and LAG RIF memberships, switchdev replay, nexthop and neighbors replay, and MACVLAN offload. - Patches torvalds#15 and torvalds#16 introduce RIFs for newly-relevant netdevices when a front panel port is enslaved (in which case all uppers are newly relevant), or, respectively, deslaved (in which case the newly-relevant netdevice is the one being deslaved). - Up until this point, the introduced scaffolding was not really used, because mlxsw still forbids enslavement of mlxsw netdevices to uppers with uppers. In patch torvalds#17, this condition is finally relaxed. A sizable selftest suite is available to test all this new code. That will be sent in a separate patchset. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Aug 4, 2023
The cited commit holds encap tbl lock unconditionally when setting up dests. But it may cause the following deadlock: PID: 1063722 TASK: ffffa062ca5d0000 CPU: 13 COMMAND: "handler8" #0 [ffffb14de05b7368] __schedule at ffffffffa1d5aa91 #1 [ffffb14de05b7410] schedule at ffffffffa1d5afdb #2 [ffffb14de05b7430] schedule_preempt_disabled at ffffffffa1d5b528 #3 [ffffb14de05b7440] __mutex_lock at ffffffffa1d5d6cb #4 [ffffb14de05b74e8] mutex_lock_nested at ffffffffa1d5ddeb #5 [ffffb14de05b74f8] mlx5e_tc_tun_encap_dests_set at ffffffffc12f2096 [mlx5_core] #6 [ffffb14de05b7568] post_process_attr at ffffffffc12d9fc5 [mlx5_core] #7 [ffffb14de05b75a0] mlx5e_tc_add_fdb_flow at ffffffffc12de877 [mlx5_core] #8 [ffffb14de05b75f0] __mlx5e_add_fdb_flow at ffffffffc12e0eef [mlx5_core] #9 [ffffb14de05b7660] mlx5e_tc_add_flow at ffffffffc12e12f7 [mlx5_core] #10 [ffffb14de05b76b8] mlx5e_configure_flower at ffffffffc12e1686 [mlx5_core] #11 [ffffb14de05b7720] mlx5e_rep_indr_offload at ffffffffc12e3817 [mlx5_core] #12 [ffffb14de05b7730] mlx5e_rep_indr_setup_tc_cb at ffffffffc12e388a [mlx5_core] torvalds#13 [ffffb14de05b7740] tc_setup_cb_add at ffffffffa1ab2ba8 torvalds#14 [ffffb14de05b77a0] fl_hw_replace_filter at ffffffffc0bdec2f [cls_flower] torvalds#15 [ffffb14de05b7868] fl_change at ffffffffc0be6caa [cls_flower] torvalds#16 [ffffb14de05b7908] tc_new_tfilter at ffffffffa1ab71f0 [1031218.028143] wait_for_completion+0x24/0x30 [1031218.028589] mlx5e_update_route_decap_flows+0x9a/0x1e0 [mlx5_core] [1031218.029256] mlx5e_tc_fib_event_work+0x1ad/0x300 [mlx5_core] [1031218.029885] process_one_work+0x24e/0x510 Actually no need to hold encap tbl lock if there is no encap action. Fix it by checking if encap action exists or not before holding encap tbl lock. Fixes: 37c3b9f ("net/mlx5e: Prevent encap offload when neigh update is running") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
borkmann
pushed a commit
that referenced
this pull request
Aug 4, 2023
In internal testing of test_maps, we sometimes observed failures like: test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void *): Assertion `bpf_map_update_elem(fd, &key, value, BPF_ANY) == 0' failed. where the errno is ENOMEM. After some troubleshooting and enabling the warnings, we saw: [ 91.304708] percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left [ 91.304716] CPU: 51 PID: 24145 Comm: test_maps Kdump: loaded Tainted: G N 6.1.38-smp-DEV #7 [ 91.304719] Hardware name: Google Astoria/astoria, BIOS 0.20230627.0-0 06/27/2023 [ 91.304721] Call Trace: [ 91.304724] <TASK> [ 91.304730] [<ffffffffa7ef83b9>] dump_stack_lvl+0x59/0x88 [ 91.304737] [<ffffffffa7ef83f8>] dump_stack+0x10/0x18 [ 91.304738] [<ffffffffa75caa0c>] pcpu_alloc+0x6fc/0x870 [ 91.304741] [<ffffffffa75ca302>] __alloc_percpu_gfp+0x12/0x20 [ 91.304743] [<ffffffffa756785e>] alloc_bulk+0xde/0x1e0 [ 91.304746] [<ffffffffa7566c02>] bpf_mem_alloc_init+0xd2/0x2f0 [ 91.304747] [<ffffffffa7547c69>] htab_map_alloc+0x479/0x650 [ 91.304750] [<ffffffffa751d6e0>] map_create+0x140/0x2e0 [ 91.304752] [<ffffffffa751d413>] __sys_bpf+0x5a3/0x6c0 [ 91.304753] [<ffffffffa751c3ec>] __x64_sys_bpf+0x1c/0x30 [ 91.304754] [<ffffffffa7ef847a>] do_syscall_64+0x5a/0x80 [ 91.304756] [<ffffffffa800009b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd This makes sense, because in atomic context, percpu allocation would not create new chunks; it would only create in non-atomic contexts. And if during prefill all precpu chunks are full, -ENOMEM would happen immediately upon next unit_alloc. Prefill phase does not actually run in atomic context, so we can use this fact to allocate non-atomically with GFP_KERNEL instead of GFP_NOWAIT. This avoids the immediate -ENOMEM. GFP_NOWAIT has to be used in unit_alloc when bpf program runs in atomic context. Even if bpf program runs in non-atomic context, in most cases, rcu read lock is enabled for the program so GFP_NOWAIT is still needed. This is often also the case for BPF_MAP_UPDATE_ELEM syscalls. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230728043359.3324347-1-zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Aug 14, 2023
…inux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-08-07 1) Few cleanups 2) Dynamic completion EQs The driver creates completion EQs for all vectors directly on driver load, even if those EQs will not be utilized later on. To allow more flexibility in managing completion EQs and to reduce the memory overhead of driver load, this series will adjust completion EQs creation to be dynamic. Meaning, completion EQs will be created only when needed. Patch #1 introduces a counter for tracking the current number of completion EQs. Patches #2-6 refactor the existing infrastructure of managing completion EQs and completion IRQs to be compatible with per-vector allocation/release requests. Patches #7-8 modify the CPU-to-IRQ affinity calculation to be resilient in case the affinity is requested but completion IRQ is not allocated yet. Patch #9 function rename. Patch #10 handles the corner case of SF performing an IRQ request when no SF IRQ pool is found, and no PF IRQ exists for the same vector. Patch #11 modify driver to use dynamically allocate completion EQs. * tag 'mlx5-updates-2023-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: Bridge, Only handle registered netdev bridge events net/mlx5: E-Switch, Remove redundant arg ignore_flow_lvl net/mlx5: Fix typo reminder -> remainder net/mlx5: remove many unnecessary NULL values net/mlx5: Allocate completion EQs dynamically net/mlx5: Handle SF IRQ request in the absence of SF IRQ pool net/mlx5: Rename mlx5_comp_vectors_count() to mlx5_comp_vectors_max() net/mlx5: Add IRQ vector to CPU lookup function net/mlx5: Introduce mlx5_cpumask_default_spread net/mlx5: Implement single completion EQ create/destroy methods net/mlx5: Use xarray to store and manage completion EQs net/mlx5: Refactor completion IRQ request/release handlers in EQ layer net/mlx5: Use xarray to store and manage completion IRQs net/mlx5: Refactor completion IRQ request/release API net/mlx5: Track the current number of completion EQs ==================== Link: https://lore.kernel.org/r/20230807175642.20834-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
Noticed with: make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin Direct leak of 45 byte(s) in 1 object(s) allocated from: #0 0x7f213f87243b in strdup (/lib64/libasan.so.8+0x7243b) #1 0x63d15f in evsel__set_filter util/evsel.c:1371 #2 0x63d15f in evsel__append_filter util/evsel.c:1387 #3 0x63d15f in evsel__append_tp_filter util/evsel.c:1400 #4 0x62cd52 in evlist__append_tp_filter util/evlist.c:1145 #5 0x62cd52 in evlist__append_tp_filter_pids util/evlist.c:1196 #6 0x541e49 in trace__set_filter_loop_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3646 #7 0x541e49 in trace__set_filter_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3670 #8 0x541e49 in trace__run /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3970 #9 0x541e49 in cmd_trace /home/acme/git/perf-tools/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools/tools/perf/perf.c:377 #12 0x4196da in run_argv /home/acme/git/perf-tools/tools/perf/perf.c:421 torvalds#13 0x4196da in main /home/acme/git/perf-tools/tools/perf/perf.c:537 torvalds#14 0x7f213e84a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Free it on evsel__exit(). Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/20230719202951.534582-2-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
To plug these leaks detected with: $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin ================================================================= ==473890==ERROR: LeakSanitizer: detected memory leaks Direct leak of 112 byte(s) in 1 object(s) allocated from: #0 0x7fdf19aba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987836 in zalloc (/home/acme/bin/perf+0x987836) #2 0x5367ae in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1289 #3 0x5367ae in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307 #4 0x5367ae in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468 #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177 #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685 #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712 #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055 #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 torvalds#13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Direct leak of 2048 byte(s) in 1 object(s) allocated from: #0 0x7f788fcba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af) #1 0x5337c0 in trace__sys_enter /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2342 #2 0x52bfb4 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3191 #3 0x52bfb4 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3699 #4 0x542883 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3726 #5 0x542883 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4069 #6 0x542883 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5155 #7 0x5ef232 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #8 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #9 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #10 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 #11 0x7f788ec4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Indirect leak of 48 byte(s) in 1 object(s) allocated from: #0 0x7fdf19aba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af) #1 0x77b335 in intlist__new util/intlist.c:116 #2 0x5367fd in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1293 #3 0x5367fd in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307 #4 0x5367fd in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468 #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177 #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685 #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712 #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055 #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 torvalds#13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/20230719202951.534582-4-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
In 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv") it only was freeing if strcmp(evsel->tp_format->system, "syscalls") returned zero, while the corresponding initialization of evsel->priv was being performed if it was _not_ zero, i.e. if the tp system wasn't 'syscalls'. Just stop looking for that and free it if evsel->priv was set, which should be equivalent. Also use the pre-existing evsel_trace__delete() function. This resolves these leaks, detected with: $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin ================================================================= ==481565==ERROR: LeakSanitizer: detected memory leaks Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966) #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307 #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333 #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458 #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480 #6 0x540e8b in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3212 #7 0x540e8b in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891 #8 0x540e8b in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156 #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966) #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307 #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333 #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458 #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480 #6 0x540dd1 in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3205 #7 0x540dd1 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891 #8 0x540dd1 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156 #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) SUMMARY: AddressSanitizer: 80 byte(s) leaked in 2 allocation(s). [root@quaco ~]# With this we plug all leaks with "perf trace sleep 1". Fixes: 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv") Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Riccardo Mancini <rickyman7@gmail.com> Link: https://lore.kernel.org/lkml/20230719202951.534582-5-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
…failure to add a probe Building perf with EXTRA_CFLAGS="-fsanitize=address" a leak is detect when trying to add a probe to a non-existent function: # perf probe -x ~/bin/perf dso__neW Probe point 'dso__neW' not found. Error: Failed to add events. ================================================================= ==296634==ERROR: LeakSanitizer: detected memory leaks Direct leak of 128 byte(s) in 1 object(s) allocated from: #0 0x7f67642ba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x7f67641a76f1 in allocate_cfi (/lib64/libdw.so.1+0x3f6f1) Direct leak of 65 byte(s) in 1 object(s) allocated from: #0 0x7f67642b95b5 in __interceptor_realloc.part.0 (/lib64/libasan.so.8+0xb95b5) #1 0x6cac75 in strbuf_grow util/strbuf.c:64 #2 0x6ca934 in strbuf_init util/strbuf.c:25 #3 0x9337d2 in synthesize_perf_probe_point util/probe-event.c:2018 #4 0x92be51 in try_to_find_probe_trace_events util/probe-event.c:964 #5 0x93d5c6 in convert_to_probe_trace_events util/probe-event.c:3512 #6 0x93d6d5 in convert_perf_probe_events util/probe-event.c:3529 #7 0x56f37f in perf_add_probe_events /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:354 #8 0x572fbc in __cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:738 #9 0x5730f2 in cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:766 #10 0x635d81 in run_builtin /var/home/acme/git/perf-tools-next/tools/perf/perf.c:323 #11 0x6362c1 in handle_internal_command /var/home/acme/git/perf-tools-next/tools/perf/perf.c:377 #12 0x63667a in run_argv /var/home/acme/git/perf-tools-next/tools/perf/perf.c:421 torvalds#13 0x636b8d in main /var/home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#14 0x7f676302950f in __libc_start_call_main (/lib64/libc.so.6+0x2950f) SUMMARY: AddressSanitizer: 193 byte(s) leaked in 2 allocation(s). # synthesize_perf_probe_point() returns a "detachec" strbuf, i.e. a malloc'ed string that needs to be free'd. An audit will be performed to find other such cases. Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/ZM0l1Oxamr4SVjfY@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
While debugging a segfault on 'perf lock contention' without an available perf.data file I noticed that it was basically calling: perf_session__delete(ERR_PTR(-1)) Resulting in: (gdb) run lock contention Starting program: /root/bin/perf lock contention [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". failed to open perf.data: No such file or directory (try 'perf record' first) Initializing perf session failed Program received signal SIGSEGV, Segmentation fault. 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 2858 if (!session->auxtrace) (gdb) p session $1 = (struct perf_session *) 0xffffffffffffffff (gdb) bt #0 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 #1 0x000000000057bb4d in perf_session__delete (session=0xffffffffffffffff) at util/session.c:300 #2 0x000000000047c421 in __cmd_contention (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2161 #3 0x000000000047dc95 in cmd_lock (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2604 #4 0x0000000000501466 in run_builtin (p=0xe597a8 <commands+552>, argc=2, argv=0x7fffffffe200) at perf.c:322 #5 0x00000000005016d5 in handle_internal_command (argc=2, argv=0x7fffffffe200) at perf.c:375 #6 0x0000000000501824 in run_argv (argcp=0x7fffffffe02c, argv=0x7fffffffe020) at perf.c:419 #7 0x0000000000501b11 in main (argc=2, argv=0x7fffffffe200) at perf.c:535 (gdb) So just set it to NULL after using PTR_ERR(session) to decode the error as perf_session__delete(NULL) is supported. The same problem was found in 'perf top' after an audit of all perf_session__new() failure handling. Fixes: 6ef81c5 ("perf session: Return error code for perf_session__new() function on failure") Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jeremie Galarneau <jeremie.galarneau@efficios.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com> Cc: Mukesh Ojha <mojha@codeaurora.org> Cc: Nageswara R Sastry <rnsastry@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Shawn Landden <shawn@git.icu> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tzvetomir Stoyanov <tstoyanov@vmware.com> Link: https://lore.kernel.org/lkml/ZN4Q2rxxsL08A8rd@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Sep 24, 2023
While debugging a segfault on 'perf lock contention' without an available perf.data file I noticed that it was basically calling: perf_session__delete(ERR_PTR(-1)) Resulting in: (gdb) run lock contention Starting program: /root/bin/perf lock contention [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". failed to open perf.data: No such file or directory (try 'perf record' first) Initializing perf session failed Program received signal SIGSEGV, Segmentation fault. 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 2858 if (!session->auxtrace) (gdb) p session $1 = (struct perf_session *) 0xffffffffffffffff (gdb) bt #0 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 #1 0x000000000057bb4d in perf_session__delete (session=0xffffffffffffffff) at util/session.c:300 #2 0x000000000047c421 in __cmd_contention (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2161 #3 0x000000000047dc95 in cmd_lock (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2604 #4 0x0000000000501466 in run_builtin (p=0xe597a8 <commands+552>, argc=2, argv=0x7fffffffe200) at perf.c:322 #5 0x00000000005016d5 in handle_internal_command (argc=2, argv=0x7fffffffe200) at perf.c:375 #6 0x0000000000501824 in run_argv (argcp=0x7fffffffe02c, argv=0x7fffffffe020) at perf.c:419 #7 0x0000000000501b11 in main (argc=2, argv=0x7fffffffe200) at perf.c:535 (gdb) So just set it to NULL after using PTR_ERR(session) to decode the error as perf_session__delete(NULL) is supported. Fixes: eef4fee ("perf lock: Dynamically allocate lockhash_table") Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Ross Zwisler <zwisler@chromium.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Yang Jihong <yangjihong1@huawei.com> Link: https://lore.kernel.org/lkml/ZN4R1AYfsD2J8lRs@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann
pushed a commit
that referenced
this pull request
Oct 6, 2023
Fix an error detected by memory sanitizer: ``` ==4033==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x55fb0fbedfc7 in read_alias_info tools/perf/util/pmu.c:457:6 #1 0x55fb0fbea339 in check_info_data tools/perf/util/pmu.c:1434:2 #2 0x55fb0fbea339 in perf_pmu__check_alias tools/perf/util/pmu.c:1504:9 #3 0x55fb0fbdca85 in parse_events_add_pmu tools/perf/util/parse-events.c:1429:32 #4 0x55fb0f965230 in parse_events_parse tools/perf/util/parse-events.y:299:6 #5 0x55fb0fbdf6b2 in parse_events__scanner tools/perf/util/parse-events.c:1822:8 #6 0x55fb0fbdf8c1 in __parse_events tools/perf/util/parse-events.c:2094:8 #7 0x55fb0fa8ffa9 in parse_events tools/perf/util/parse-events.h:41:9 #8 0x55fb0fa8ffa9 in test_event tools/perf/tests/parse-events.c:2393:8 #9 0x55fb0fa8f458 in test__pmu_events tools/perf/tests/parse-events.c:2551:15 #10 0x55fb0fa6d93f in run_test tools/perf/tests/builtin-test.c:242:9 #11 0x55fb0fa6d93f in test_and_print tools/perf/tests/builtin-test.c:271:8 #12 0x55fb0fa6d082 in __cmd_test tools/perf/tests/builtin-test.c:442:5 torvalds#13 0x55fb0fa6d082 in cmd_test tools/perf/tests/builtin-test.c:564:9 torvalds#14 0x55fb0f942720 in run_builtin tools/perf/perf.c:322:11 torvalds#15 0x55fb0f942486 in handle_internal_command tools/perf/perf.c:375:8 torvalds#16 0x55fb0f941dab in run_argv tools/perf/perf.c:419:2 torvalds#17 0x55fb0f941dab in main tools/perf/perf.c:535:3 ``` Fixes: 7b723db ("perf pmu: Be lazy about loading event info files from sysfs") Signed-off-by: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20230914022425.1489035-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Oct 6, 2023
The following call trace shows a deadlock issue due to recursive locking of mutex "device_mutex". First lock acquire is in target_for_each_device() and second in target_free_device(). PID: 148266 TASK: ffff8be21ffb5d00 CPU: 10 COMMAND: "iscsi_ttx" #0 [ffffa2bfc9ec3b18] __schedule at ffffffffa8060e7f #1 [ffffa2bfc9ec3ba0] schedule at ffffffffa8061224 #2 [ffffa2bfc9ec3bb8] schedule_preempt_disabled at ffffffffa80615ee #3 [ffffa2bfc9ec3bc8] __mutex_lock at ffffffffa8062fd7 #4 [ffffa2bfc9ec3c40] __mutex_lock_slowpath at ffffffffa80631d3 #5 [ffffa2bfc9ec3c50] mutex_lock at ffffffffa806320c #6 [ffffa2bfc9ec3c68] target_free_device at ffffffffc0935998 [target_core_mod] #7 [ffffa2bfc9ec3c90] target_core_dev_release at ffffffffc092f975 [target_core_mod] #8 [ffffa2bfc9ec3ca0] config_item_put at ffffffffa79d250f #9 [ffffa2bfc9ec3cd0] config_item_put at ffffffffa79d2583 #10 [ffffa2bfc9ec3ce0] target_devices_idr_iter at ffffffffc0933f3a [target_core_mod] #11 [ffffa2bfc9ec3d00] idr_for_each at ffffffffa803f6fc #12 [ffffa2bfc9ec3d60] target_for_each_device at ffffffffc0935670 [target_core_mod] torvalds#13 [ffffa2bfc9ec3d98] transport_deregister_session at ffffffffc0946408 [target_core_mod] torvalds#14 [ffffa2bfc9ec3dc8] iscsit_close_session at ffffffffc09a44a6 [iscsi_target_mod] torvalds#15 [ffffa2bfc9ec3df0] iscsit_close_connection at ffffffffc09a4a88 [iscsi_target_mod] torvalds#16 [ffffa2bfc9ec3df8] finish_task_switch at ffffffffa76e5d07 torvalds#17 [ffffa2bfc9ec3e78] iscsit_take_action_for_connection_exit at ffffffffc0991c23 [iscsi_target_mod] torvalds#18 [ffffa2bfc9ec3ea0] iscsi_target_tx_thread at ffffffffc09a403b [iscsi_target_mod] torvalds#19 [ffffa2bfc9ec3f08] kthread at ffffffffa76d8080 torvalds#20 [ffffa2bfc9ec3f50] ret_from_fork at ffffffffa8200364 Fixes: 36d4cb4 ("scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Link: https://lore.kernel.org/r/20230918225848.66463-1-junxiao.bi@oracle.com Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
borkmann
pushed a commit
that referenced
this pull request
Oct 20, 2023
Chuyi Zhou says: ==================== Add Open-coded task, css_task and css iters This is version 6 of task, css_task and css iters support. --- Changelog --- v5 -> v6: Patch #3: * In bpf_iter_task_next, return pos rather than goto out. (Andrii) Patch #2, #3, #4: * Add the missing __diag_ignore_all to avoid kernel build warning Patch #5, #6, #7: * Add Andrii's ack Patch #8: * In BPF prog iter_css_task_for_each, return -EPERM rather than 0, and ensure stack_mprotect() in iters.c not success. If not, it would cause the subsequent 'test_lsm' fail, since the 'is_stack' check in test_int_hook(lsm.c) would not be guaranteed. (https://github.com/kernel-patches/bpf/actions/runs/6489662214/job/17624665086?pr=5790) v4 -> v5:https://lore.kernel.org/lkml/20231007124522.34834-1-zhouchuyi@bytedance.com/ Patch 3~4: * Relax the BUILD_BUG_ON check in bpf_iter_task_new and bpf_iter_css_new to avoid netdev/build_32bit CI error. (https://netdev.bots.linux.dev/static/nipa/790929/13412333/build_32bit/stderr) Patch 8: * Initialize skel pointer to fix the LLVM-16 build CI error (https://github.com/kernel-patches/bpf/actions/runs/6462875618/job/17545170863) v3 -> v4:https://lore.kernel.org/all/20230925105552.817513-1-zhouchuyi@bytedance.com/ * Address all the comments from Andrii in patch-3 ~ patch-6 * Collect Tejun's ack * Add a extra patch to rename bpf_iter_task.c to bpf_iter_tasks.c * Seperate three BPF program files for selftests (iters_task.c iters_css_task.c iters_css.c) v2 -> v3:https://lore.kernel.org/lkml/20230912070149.969939-1-zhouchuyi@bytedance.com/ Patch 1 (cgroup: Prepare for using css_task_iter_*() in BPF) * Add tj's ack and Alexei's suggest-by. Patch 2 (bpf: Introduce css_task open-coded iterator kfuncs) * Use bpf_mem_alloc/bpf_mem_free rather than kzalloc() * Add KF_TRUSTED_ARGS for bpf_iter_css_task_new (Alexei) * Move bpf_iter_css_task's definition from uapi/linux/bpf.h to kernel/bpf/task_iter.c and we can use it from vmlinux.h * Move bpf_iter_css_task_XXX's declaration from bpf_helpers.h to bpf_experimental.h Patch 3 (Introduce task open coded iterator kfuncs) * Change th API design keep consistent with SEC("iter/task"), support iterating all threads(BPF_TASK_ITERATE_ALL) and threads of a specific task (BPF_TASK_ITERATE_THREAD).(Andrii) * Move bpf_iter_task's definition from uapi/linux/bpf.h to kernel/bpf/task_iter.c and we can use it from vmlinux.h * Move bpf_iter_task_XXX's declaration from bpf_helpers.h to bpf_experimental.h Patch 4 (Introduce css open-coded iterator kfuncs) * Change th API design keep consistent with cgroup_iters, reuse BPF_CGROUP_ITER_DESCENDANTS_PRE/BPF_CGROUP_ITER_DESCENDANTS_POST /BPF_CGROUP_ITER_ANCESTORS_UP(Andrii) * Add KF_TRUSTED_ARGS for bpf_iter_css_new * Move bpf_iter_css's definition from uapi/linux/bpf.h to kernel/bpf/task_iter.c and we can use it from vmlinux.h * Move bpf_iter_css_XXX's declaration from bpf_helpers.h to bpf_experimental.h Patch 5 (teach the verifier to enforce css_iter and task_iter in RCU CS) * Add KF flag KF_RCU_PROTECTED to maintain kfuncs which need RCU CS.(Andrii) * Consider STACK_ITER when using bpf_for_each_spilled_reg. Patch 6 (Let bpf_iter_task_new accept null task ptr) * Add this extra patch to let bpf_iter_task_new accept a 'nullable' * task pointer(Andrii) Patch 7 (selftests/bpf: Add tests for open-coded task and css iter) * Add failure testcase(Alexei) Changes from v1(https://lore.kernel.org/lkml/20230827072057.1591929-1-zhouchuyi@bytedance.com/): - Add a pre-patch to make some preparations before supporting css_task iters.(Alexei) - Add an allowlist for css_task iters(Alexei) - Let bpf progs do explicit bpf_rcu_read_lock() when using process iters and css_descendant iters.(Alexei) --------------------- In some BPF usage scenarios, it will be useful to iterate the process and css directly in the BPF program. One of the expected scenarios is customizable OOM victim selection via BPF[1]. Inspired by Dave's task_vma iter[2], this patchset adds three types of open-coded iterator kfuncs: 1. bpf_task_iters. It can be used to 1) iterate all process in the system, like for_each_forcess() in kernel. 2) iterate all threads in the system. 3) iterate all threads of a specific task 2. bpf_css_iters. It works like css_task_iter_{start, next, end} and would be used to iterating tasks/threads under a css. 3. css_iters. It works like css_next_descendant_{pre, post} to iterating all descendant css. BPF programs can use these kfuncs directly or through bpf_for_each macro. link[1]: https://lore.kernel.org/lkml/20230810081319.65668-1-zhouchuyi@bytedance.com/ link[2]: https://lore.kernel.org/all/20230810183513.684836-1-davemarchevsky@fb.com/ ==================== Link: https://lore.kernel.org/r/20231018061746.111364-1-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Oct 24, 2023
Eduard Zingerman says: ==================== exact states comparison for iterator convergence checks Iterator convergence logic in is_state_visited() uses state_equals() for states with branches counter > 0 to check if iterator based loop converges. This is not fully correct because state_equals() relies on presence of read and precision marks on registers. These marks are not guaranteed to be finalized while state has branches. Commit message for patch #3 describes a program that exhibits such behavior. This patch-set aims to fix iterator convergence logic by adding notion of exact states comparison. Exact comparison does not rely on presence of read or precision marks and thus is more strict. As explained in commit message for patch #3 exact comparisons require addition of speculative register bounds widening. The end result for BPF verifier users could be summarized as follows: (!) After this update verifier would reject programs that conjure an imprecise value on the first loop iteration and use it as precise on the second (for iterator based loops). I urge people to at least skim over the commit message for patch #3. Patches are organized as follows: - patches #1,2: moving/extracting utility functions; - patch #3: introduces exact mode for states comparison and adds widening heuristic; - patch #4: adds test-cases that demonstrate why the series is necessary; - patch #5: extends patch #3 with a notion of state loop entries, these entries have to be tracked to correctly identify that different verifier states belong to the same states loop; - patch #6: adds a test-case that demonstrates a program which requires loop entry tracking for correct verification; - patch #7: just adds a few debug prints. The following actions are planned as a followup for this patch-set: - implementation has to be adapted for callbacks handling logic as a part of a fix for [1]; - it is necessary to explore ways to improve widening heuristic to handle iters_task_vma test w/o need to insert barrier_var() calls; - explored states eviction logic on cache miss has to be extended to either: - allow eviction of checkpoint states -or- - be sped up in case if there are many active checkpoints associated with the same instruction. The patch-set is a followup for mailing list discussion [1]. Changelog: - V2 [3] -> V3: - correct check for stack spills in widen_imprecise_scalars(), added test case progs/iters.c:widen_spill to check the behavior (suggested by Andrii); - allow eviction of checkpoint states in is_state_visited() to avoid pathological verifier performance when iterator based loop does not converge (discussion with Alexei). - V1 [2] -> V2, applied changes suggested by Alexei offlist: - __explored_state() function removed; - same_callsites() function is now used in clean_live_states(); - patches #1,2 are added as preparatory code movement; - in process_iter_next_call() a safeguard is added to verify that cur_st->parent exists and has expected insn index / call sites. [1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/ [2] https://lore.kernel.org/bpf/20231021005939.1041-1-eddyz87@gmail.com/ [3] https://lore.kernel.org/bpf/20231022010812.9201-1-eddyz87@gmail.com/ ==================== Link: https://lore.kernel.org/r/20231024000917.12153-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Oct 27, 2023
Jiri Pirko says: ==================== devlink: fix a deadlock when taking devlink instance lock while holding RTNL lock devlink_port_fill() may be called sometimes with RTNL lock held. When putting the nested port function devlink instance attrs, current code takes nested devlink instance lock. In that case lock ordering is wrong. Patch #1 is a dependency of patch #2. Patch #2 converts the peernet2id_alloc() call to rely in RCU so it could called without devlink instance lock. Patch #3 takes device reference for devlink instance making sure that device does not disappear before devlink_release() is called. Patch #4 benefits from the preparations done in patches #2 and #3 and removes the problematic nested devlink lock aquisition. Patched #5-#7 improve documentation to reflect this issue so it is avoided in the future. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Oct 27, 2023
Petr Machata says: ==================== mlxsw: Move allocation of LAG table to the driver PGT is an in-HW table that maps addresses to sets of ports. Then when some HW process needs a set of ports as an argument, instead of embedding the actual set in the dynamic configuration, what gets configured is the address referencing the set. The HW then works with the appropriate PGT entry. Within the PGT is placed a LAG table. That is a contiguous block of PGT memory where each entry describes which ports are members of the corresponding LAG port. The PGT is split to two parts: one managed by the FW, and one managed by the driver. Historically, the FW part included also the LAG table, referred to as FW LAG mode. Giving the responsibility for placement of the LAG table to the driver, referred to as SW LAG mode, makes the whole system more flexible. The FW currently supports both FW and SW LAG modes. To shed complexity, the FW should in the future only support SW LAG mode. Hence this patchset, where support for placement of LAG is added to mlxsw. There are FW versions out there that do not support SW LAG mode, and on Spectrum-1 in particular, there is no plan to support it at all. mlxsw will therefore have to support both modes of operation. Another aspect is that at least on Spectrum-1, there are FW versions out there that claim to support driver-placed LAG table, but then reject or ignore configurations enabling the same. The driver thus has to have a say in whether an attempt to configure SW LAG mode should even be done. The feature is therefore expressed in terms of "does the driver prefer SW LAG mode?", and "what LAG mode the PCI module managed to configure the FW with". This is unlike current flood mode configuration, where the driver can give a strict value, and that's what gets configured. But it gives a chance to the driver to determine whether LAG mode should be enabled at all. The "does the driver prefer SW LAG mode?" bit is expressed as a boolean lag_mode_prefer_sw. The reason for this is largely another feature that will be introduced in a follow-up patchset: support for CFF flood mode. The driver currently requires that the FW be configured with what is called controlled flood mode. But on capable systems, CFF would be preferred. So there are two values in flight: the preferred flood mode, and the fallback. This could be expressed with an array of flood modes ordered by preference, but that looks like an overkill in comparison. This flag/value model is then reused for LAG mode as well, except the fallback value is absent and implied to be FW, because there are no other values to choose from. The patchset progresses as follows: - Patches #1 to #5 adjust reg.h and cmd.h with new register fields, constants and remarks. - Patches #6 and #7 add the ability to request SW LAG mode and to query the LAG mode that was actually negotiated. This is where the abovementioned lag_mode_prefer_sw flag is added. - Patches #7 to #9 generalize PGT allocations to make it possible to allocate the LAG table, which is done in patch #10. - In patch #11, toggle lag_mode_prefer_sw on Spectrum-2 and above, which makes the newly-added code live. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann
pushed a commit
that referenced
this pull request
Oct 27, 2023
Jiri Pirko says: ==================== devlink: finish conversion to generated split_ops This patchset converts the remaining genetlink commands to generated split_ops and removes the existing small_ops arrays entirely alongside with shared netlink attribute policy. Patches #1-#6 are just small preparations and small fixes on multiple places. Note that couple of patches contain the "Fixes" tag but no need to put them into -net tree. Patch #7 is a simple rename preparation Patch #8 is the main one in this set and adds actual definitions of cmds in to yaml file. Patches #9-#10 finalize the change removing bits that are no longer in use. ==================== Link: https://lore.kernel.org/r/20231021112711.660606-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
borkmann
pushed a commit
that referenced
this pull request
Oct 27, 2023
The ath11k active pdevs are protected by RCU but the temperature event handling code calling ath11k_mac_get_ar_by_pdev_id() was not marked as a read-side critical section as reported by RCU lockdep: ============================= WARNING: suspicious RCU usage 6.6.0-rc6 #7 Not tainted ----------------------------- drivers/net/wireless/ath/ath11k/mac.c:638 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by swapper/0/0. ... Call trace: ... lockdep_rcu_suspicious+0x16c/0x22c ath11k_mac_get_ar_by_pdev_id+0x194/0x1b0 [ath11k] ath11k_wmi_tlv_op_rx+0xa84/0x2c1c [ath11k] ath11k_htc_rx_completion_handler+0x388/0x510 [ath11k] Mark the code in question as an RCU read-side critical section to avoid any potential use-after-free issues. Tested-on: WCN6855 hw2.1 PCI WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.23 Fixes: a41d103 ("ath11k: add thermal sensor device support") Cc: stable@vger.kernel.org # 5.7 Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Acked-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com> Link: https://lore.kernel.org/r/20231019153115.26401-2-johan+linaro@kernel.org
borkmann
pushed a commit
that referenced
this pull request
Oct 31, 2023
Ido Schimmel says: ==================== Add MDB get support This patchset adds MDB get support, allowing user space to request a single MDB entry to be retrieved instead of dumping the entire MDB. Support is added in both the bridge and VXLAN drivers. Patches #1-#6 are small preparations in both drivers. Patches #7-#8 add the required uAPI attributes for the new functionality and the MDB get net device operation (NDO), respectively. Patches #9-#10 implement the MDB get NDO in both drivers. Patch #11 registers a handler for RTM_GETMDB messages in rtnetlink core. The handler derives the net device from the ifindex specified in the ancillary header and invokes its MDB get NDO. Patches #12-torvalds#13 add selftests by converting tests that use MDB dump with grep to the new MDB get functionality. iproute2 changes can be found here [1]. v2: * Patch #7: Add a comment to describe attributes structure. * Patch #9: Add a comment above spin_lock_bh(). [1] https://github.com/idosch/iproute2/tree/submit/mdb_get_v1 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.