Convert the ice driver to Page Pool [POC] #8

michalQb · 2024-12-09T14:09:15Z

No description provided.

The commit 5384467 ("iavf: kill "legacy-rx" for good") removes the legacy Rx in iavf driver. Now it is time for ice. Converting the page management to page_pool is a good opportunity to remove an obsolete code per the arguments listed in the original iavf commit. Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>

As in case of iavf driver, the commit 920d86f ("iavf: drop page splitting and recycling") introduces an intermediate step in page_pool conversion, perform the same step for ice. Remove all page splitting/recycling code. Just always allocate a new page and don't touch its refcount, so that it gets freed by the core stack later. Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>

The patch is the ice version of the IAVF commit 5fa4caf ("iavf: switch to Page Pool"), enhanced with XDP stuff conversion. Generic Rx path changes can be described in exactly the same way as the original IAVF commit. Now that the ice driver simply uses dev_alloc_page() + free_page() with no custom recycling logics, it can easily be switched to using Page Pool / libeth API instead. This allows to removing the whole dancing around headroom, HW buffer size, and page order. All DMA-for-device is now done in the PP core, for-CPU -- in the libeth helper. Use skb_mark_for_recycle() to bring back the recycling and restore the performance. In terms of XDP, use basic XDP helpers provided by the libeth_xdp module. For XDP_TX skip DMA mapping for each xmitted frame. Instead, map all buffers as bi-directional and rely on the page_pool page management mechanism. Thanks to this, performance of XDP_TX can be improved, especially in scenarios when IOMMU is enabled. Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>

Since the ice driver uses the same Rx path for handling regular NAPI Rx queues as well as control VSI flow director queues, the page_pool could not be applied in a generic way. Skip that design issue temporarily, just to avoid the POC crashes, by disabling the flow director completely. The issue has to be addressed after we decide to upstream that page_pool solution. Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>

KMSAN reported a use-after-free issue in eth_skb_pkt_type()[1]. The cause of the issue was that eth_skb_pkt_type() accessed skb's data that didn't contain an Ethernet header. This occurs when bpf_prog_test_run_xdp() passes an invalid value as the user_data argument to bpf_test_init(). Fix this by returning an error when user_data is less than ETH_HLEN in bpf_test_init(). Additionally, remove the check for "if (user_size > size)" as it is unnecessary. [1] BUG: KMSAN: use-after-free in eth_skb_pkt_type include/linux/etherdevice.h:627 [inline] BUG: KMSAN: use-after-free in eth_type_trans+0x4ee/0x980 net/ethernet/eth.c:165 eth_skb_pkt_type include/linux/etherdevice.h:627 [inline] eth_type_trans+0x4ee/0x980 net/ethernet/eth.c:165 __xdp_build_skb_from_frame+0x5a8/0xa50 net/core/xdp.c:635 xdp_recv_frames net/bpf/test_run.c:272 [inline] xdp_test_run_batch net/bpf/test_run.c:361 [inline] bpf_test_run_xdp_live+0x2954/0x3330 net/bpf/test_run.c:390 bpf_prog_test_run_xdp+0x148e/0x1b10 net/bpf/test_run.c:1318 bpf_prog_test_run+0x5b7/0xa30 kernel/bpf/syscall.c:4371 __sys_bpf+0x6a6/0xe20 kernel/bpf/syscall.c:5777 __do_sys_bpf kernel/bpf/syscall.c:5866 [inline] __se_sys_bpf kernel/bpf/syscall.c:5864 [inline] __x64_sys_bpf+0xa4/0xf0 kernel/bpf/syscall.c:5864 x64_sys_call+0x2ea0/0x3d90 arch/x86/include/generated/asm/syscalls_64.h:322 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xd9/0x1d0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Uninit was created at: free_pages_prepare mm/page_alloc.c:1056 [inline] free_unref_page+0x156/0x1320 mm/page_alloc.c:2657 __free_pages+0xa3/0x1b0 mm/page_alloc.c:4838 bpf_ringbuf_free kernel/bpf/ringbuf.c:226 [inline] ringbuf_map_free+0xff/0x1e0 kernel/bpf/ringbuf.c:235 bpf_map_free kernel/bpf/syscall.c:838 [inline] bpf_map_free_deferred+0x17c/0x310 kernel/bpf/syscall.c:862 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xa2b/0x1b60 kernel/workqueue.c:3310 worker_thread+0xedf/0x1550 kernel/workqueue.c:3391 kthread+0x535/0x6b0 kernel/kthread.c:389 ret_from_fork+0x6e/0x90 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 CPU: 1 UID: 0 PID: 17276 Comm: syz.1.16450 Not tainted 6.12.0-05490-g9bb88c659673 #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Fixes: be3d72a ("bpf: move user_size out of bpf_test_init") Reported-by: syzkaller <syzkaller@googlegroups.com> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20250121150643.671650-1-syoshida@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>

napi_schedule() is expected to be called either: * From an interrupt, where raised softirqs are handled on IRQ exit * From a softirq disabled section, where raised softirqs are handled on the next call to local_bh_enable(). * From a softirq handler, where raised softirqs are handled on the next round in do_softirq(), or further deferred to a dedicated kthread. Other bare tasks context may end up ignoring the raised NET_RX vector until the next random softirq handling opportunity, which may not happen before a while if the CPU goes idle afterwards with the tick stopped. Such "misuses" have been detected on several places thanks to messages of the kind: "NOHZ tick-stop error: local softirq work is pending, handler #8!!!" For example: __raise_softirq_irqoff __napi_schedule rtl8152_runtime_resume.isra.0 rtl8152_resume usb_resume_interface.isra.0 usb_resume_both __rpm_callback rpm_callback rpm_resume __pm_runtime_resume usb_autoresume_device usb_remote_wakeup hub_event process_one_work worker_thread kthread ret_from_fork ret_from_fork_asm And also: * drivers/net/usb/r8152.c::rtl_work_func_t * drivers/net/netdevsim/netdev.c::nsim_start_xmit There is a long history of issues of this kind: 019edd0 ("ath10k: sdio: Add missing BH locking around napi_schdule()") 3300685 ("idpf: disable local BH when scheduling napi for marker packets") e3d5d70 ("net: lan78xx: fix "softirq work is pending" error") e55c27e ("mt76: mt7615: add missing bh-disable around rx napi schedule") c0182aa ("mt76: mt7915: add missing bh-disable around tx napi enable/schedule") 970be1d ("mt76: disable BH around napi_schedule() calls") 019edd0 ("ath10k: sdio: Add missing BH locking around napi_schdule()") 30bfec4 ("can: rx-offload: can_rx_offload_threaded_irq_finish(): add new function to be called from threaded interrupt") e63052a ("mlx5e: add add missing BH locking around napi_schdule()") 83a0c6e ("i40e: Invoke softirqs after napi_reschedule") bd4ce94 ("mlx4: Invoke softirqs after napi_reschedule") 8cf699e ("mlx4: do not call napi_schedule() without care") ec13ee8 ("virtio_net: invoke softirqs after __napi_schedule") This shows that relying on the caller to arrange a proper context for the softirqs to be handled while calling napi_schedule() is very fragile and error prone. Also fixing them can also prove challenging if the caller may be called from different kinds of contexts. Therefore fix this from napi_schedule() itself with waking up ksoftirqd when softirqs are raised from task contexts. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Reported-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Francois Romieu <romieu@fr.zoreil.com> Closes: https://lore.kernel.org/lkml/354a2690-9bbf-4ccb-8769-fa94707a9340@molgen.mpg.de/ Cc: Breno Leitao <leitao@debian.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250223221708.27130-1-frederic@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Chia-Yu Chang says: ==================== AccECN protocol preparation patch series Please find the v7 v7 (03-Mar-2025) - Move 2 new patches added in v6 to the next AccECN patch series v6 (27-Dec-2024) - Avoid removing removing the potential CA_ACK_WIN_UPDATE in ack_ev_flags of patch #1 (Eric Dumazet <edumazet@google.com>) - Add reviewed-by tag in patches #2, #3, #4, #5, #6, #7, #8, alobakin#12, alobakin#14 - Foloiwng 2 new pathces are added after patch alobakin#9 (Patch that adds SKB_GSO_TCP_ACCECN) * New patch alobakin#10 to replace exisiting SKB_GSO_TCP_ECN with SKB_GSO_TCP_ACCECN in the driver to avoid CWR flag corruption * New patch alobakin#11 adds AccECN for virtio by adding new negotiation flag (VIRTIO_NET_F_HOST/GUEST_ACCECN) in feature handshake and translating Accurate ECN GSO flag between virtio_net_hdr (VIRTIO_NET_HDR_GSO_ACCECN) and skb header (SKB_GSO_TCP_ACCECN) - Add detailed changelog and comments in alobakin#13 (Eric Dumazet <edumazet@google.com>) - Move patch alobakin#14 to the next AccECN patch series (Eric Dumazet <edumazet@google.com>) v5 (5-Nov-2024) - Add helper function "tcp_flags_ntohs" to preserve last 2 bytes of TCP flags of patch #4 (Paolo Abeni <pabeni@redhat.com>) - Fix reverse X-max tree order of patches #4, alobakin#11 (Paolo Abeni <pabeni@redhat.com>) - Rename variable "delta" as "timestamp_delta" of patch #2 fo clariety - Remove patch alobakin#14 in this series (Paolo Abeni <pabeni@redhat.com>, Joel Granados <joel.granados@kernel.org>) v4 (21-Oct-2024) - Fix line length warning of patches #2, #4, #8, alobakin#10, alobakin#11, alobakin#14 - Fix spaces preferred around '|' (ctx:VxV) warning of patch #7 - Add missing CC'ed of patches #4, alobakin#12, alobakin#14 v3 (19-Oct-2024) - Fix build error in v2 v2 (18-Oct-2024) - Fix warning caused by NETIF_F_GSO_ACCECN_BIT in patch alobakin#9 (Jakub Kicinski <kuba@kernel.org>) The full patch series can be found in https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

michalQb added 4 commits December 9, 2024 14:15

michalQb requested a review from alobakin December 9, 2024 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert the ice driver to Page Pool [POC] #8

Convert the ice driver to Page Pool [POC] #8

michalQb commented Dec 9, 2024

Convert the ice driver to Page Pool [POC] #8

Are you sure you want to change the base?

Convert the ice driver to Page Pool [POC] #8

Conversation

michalQb commented Dec 9, 2024