netdev CI testing #6666

Since commit 71ae2cb ("net: plip: Fix fall-through warnings for Clang") plip was not able to send any packets, this patch replaces one unintended break; with fallthrough; which was originally missed by commit 9525d69 ("net: plip: mark expected switch fall-throughs"). I have verified with a real hardware PLIP connection that everything works once again after applying this patch. Fixes: 71ae2cb ("net: plip: Fix fall-through warnings for Clang") Signed-off-by: Jakub Boehm <boehm.jakub@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Message-ID: <20241015-net-plip-tx-fix-v1-1-32d8be1c7e0b@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Simon has been diligently and consistently reviewing networking changes for at least as long as our development statistics go back. Often if not usually topping the list of reviewers. Make his role official. Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Message-ID: <20241015153005.2854018-1-kuba@kernel.org> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

This change fixes a rare issue where the PHY fails to detect a link due to incorrect reset behavior. The SW_RESET definition was incorrectly assigned to bit 14, which is the Digital Restart bit according to the datasheet. This commit corrects SW_RESET to bit 15 and assigns DIG_RESTART to bit 14 as per the datasheet specifications. The SW_RESET define is only used in the phy_reset function, which fully re-initializes the PHY after the reset is performed. The change in the bit definitions should not have any negative impact on the functionality of the PHY. v2: - added Fixes tag - improved commit message Cc: stable@vger.kernel.org Fixes: 5dc39fd ("net: phy: DP83822: Add ability to advertise Fiber connection") Signed-off-by: Alex Michel <alex.michel@wiedemann-group.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Message-ID: <AS1P250MB0608A798661549BF83C4B43EA9462@AS1P250MB0608.EURP250.PROD.OUTLOOK.COM> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

In netpoll configuration the completion processing can happen in hard irq context which will break with spin_lock_bh() for fullfilling RX timestamp in case of all packets timestamping. Replace it with spin_lock_irqsave() variant. Fixes: 7f5515d ("bnxt_en: Get the RX packet timestamp") Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Message-ID: <20241016195234.2622004-1-vadfed@meta.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

The common code with some packet and index manipulations is extracted and moved to newly implemented helper to make the code more readable and avoid duplication. This is a preparation for skb allocation failure handling. Found by Linux Verification Center (linuxtesting.org) with SVACE. Suggested-by: Simon Horman <horms@kernel.org> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

…_rx() build_skb() returns NULL in case of a memory allocation failure so handle it inside __octep_oq_process_rx() to avoid NULL pointer dereference. __octep_oq_process_rx() is called during NAPI polling by the driver. If skb allocation fails, keep on pulling packets out of the Rx DMA queue: we shouldn't break the polling immediately and thus falsely indicate to the octep_napi_poll() that the Rx pressure is going down. As there is no associated skb in this case, don't process the packets and don't push them up the network stack - they are skipped. Helper function is implemented to unmmap/flush all the fragment buffers used by the dropped packet. 'alloc_failures' counter is incremented to mark the skb allocation error in driver statistics. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 37d79d0 ("octeon_ep: add Tx/Rx processing and interrupt support") Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

mv88e6393x_port_set_policy doesn't correctly shift the ptr value when converting the policy format between the old and new styles, so the target register ends up with the ptr being written over the data bits. Shift the pointer to align with the format expected by mv88e6393x_port_policy_write(). Fixes: 6584b26 ("net: dsa: mv88e6xxx: implement .port_set_policy for Amethyst") Signed-off-by: Peter Rashleigh <peter@rashleigh.ca> Reviewed-by: Simon Horman <horms@kernel.org> Message-ID: <20241016040822.3917-1-peter@rashleigh.ca> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Mapping all my previously used emails to my kernel.org email. Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Message-ID: <172909057364.2452383.8019986488234344607.stgit@firesoul> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

samples/pktgen is missing in the MAINTAINERS file. Suggested-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Message-ID: <20241018005301.10052-1-liuhangbin@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

fbnic fails to link as built-in when PTP support is in a loadable module: aarch64-linux-ld: drivers/net/ethernet/meta/fbnic/fbnic_ethtool.o: in function `fbnic_get_ts_info': fbnic_ethtool.c:(.text+0x428): undefined reference to `ptp_clock_index' aarch64-linux-ld: drivers/net/ethernet/meta/fbnic/fbnic_time.o: in function `fbnic_time_start': fbnic_time.c:(.text+0x820): undefined reference to `ptp_schedule_worker' aarch64-linux-ld: drivers/net/ethernet/meta/fbnic/fbnic_time.o: in function `fbnic_ptp_setup': fbnic_time.c:(.text+0xa68): undefined reference to `ptp_clock_register' Add the appropriate dependency to enforce this. Fixes: 6a2b3ed ("eth: fbnic: add RX packets timestamping support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Message-ID: <20241016062303.2551686-1-arnd@kernel.org> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

There's not really a benefit here in taking the RTNL lock. The task handler does exception handling only, so we're in trouble anyway when we come here, and there's no need to protect against e.g. a parallel ethtool call. A benefit of removing the RTNL lock here is that we now can synchronously cancel the workqueue from a context holding the RTNL mutex. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

So far we use a custom flag to define when a task can be scheduled and when not. Let's use the standard mechanism with disable_work() et al instead. Note that in rtl8169_close() we can remove the call to cancel_work() because we now call disable_work_sync() in rtl8169_down() already. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

… to warn level In case of a problem with firmware loading we inform at the driver level, in addition the firmware load code itself issues warnings. Therefore switch to firmware_request_nowarn() to avoid duplicated error messages. In addition switch to warn level because the firmware is optional and typically just fixes compatibility issues. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Message-ID: <d9c5094c-89a6-40e2-b5fe-8df7df4624ef@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Remove rtl_dash_loop_wait_high/low to simplify the code. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Message-ID: <fb8c490c-2d92-48f5-8bbf-1fc1f2ee1649@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

This patch fixes spelling errors, re-arrange vars with reverse Xmas tree and remove unnecessary parens in mediatek-ge-soc.c. Signed-off-by: SkyLake.Huang <skylake.huang@mediatek.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

This patch shrinks line wrapping to 80 chars. Also, in tx_amp_fill_result(), use FIELD_PREP() to prettify code. Signed-off-by: SkyLake.Huang <skylake.huang@mediatek.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

This patch propagates error code correctly in cal_cycle() and improve with FIELD_GET(). Signed-off-by: SkyLake.Huang <skylake.huang@mediatek.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Run airoha_qdma_cleanup_tx_queue() in ndo_stop callback in order to unmap pending skbs. Moreover, reset BQL txq state stopping the netdevice, Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Hariprasad Kelam <hkelam@marvell.com> Message-ID: <20241017-airoha-en7581-reset-bql-v1-1-08c0c9888de5@kernel.org> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

The first boards show up with Realtek's RTL8125D. This MAC/PHY chip comes with an integrated 2.5Gbps PHY with ID 0x001cc841. It's not clear yet whether there's an external version of this PHY and how Realtek calls it, therefore use the numeric id for now. Link: https://lore.kernel.org/netdev/2ada65e1-5dfa-456c-9334-2bc51272e9da@gmail.com/T/ Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Message-ID: <7d2924de-053b-44d2-a479-870dc3878170@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Register a6d/12 is shadowing register MDIO_AN_EEE_ADV2. So this line disables advertisement of EEE at 2.5G. Latest vendor driver r8125 doesn't do this (any longer?), so this mode seems to be safe. EEE saves quite some energy, therefore enable this mode per default. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Message-ID: <95dd5a0c-09ea-4847-94d9-b7aa3063e8ff@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

There are some spelling mistakes of 'accelaration', 'exprienced' and 'rewritting' in comments which should be 'acceleration', 'experienced' and 'rewriting'. Suggested-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/all/20241017162846.GA51712@kernel.org/ Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Message-ID: <90D42CB167CA0842+20241018021910.31359-1-wangyuli@uniontech.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

In NC-SI specification, NC-SI is using RMII, not MII. Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Message-ID: <20241018053331.1900100-1-jacky_chou@aspeedtech.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

# Conflicts: # drivers/net/ethernet/mellanox/mlx5/core/eswitch.c

Seems Alcatel Lucent G-010S-P also have the same problem that it uses TX_FAULT pin for SOC uart. So apply sfp_fixup_ignore_tx_fault to it. Signed-off-by: Shengyu Qu <wiagn233@outlook.com> Signed-off-by: NipaLocal <nipa@local>

In mac_probe() there are calls to of_find_device_by_node() which takes references to of_dev->dev. These references are not saved and not released later on error path in mac_probe() and in mac_remove(). Add new fields into mac_device structure to save references taken for future use in mac_probe() and mac_remove(). This is a preparation for further reference leaks fix. Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Signed-off-by: NipaLocal <nipa@local>

In mac_probe() there are multiple calls to of_find_device_by_node(), fman_bind() and fman_port_bind() which takes references to of_dev->dev. Not all references taken by these calls are released later on error path in mac_probe() and in mac_remove() which lead to reference leaks. Add references release. Fixes: 3933961 ("fsl/fman: Add FMan MAC driver") Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Signed-off-by: NipaLocal <nipa@local>

Add pci table supported in this module, and implement pci_driver function to initialize this driver. hibmcge is a passthrough network device. Its software runs on the host side, and the MAC hardware runs on the BMC side to reduce the host CPU area. The software interacts with the MAC hardware through the PCIe. ┌─────────────────────────┐ │ HOST CPU network device │ │ ┌──────────────┐ │ │ │hibmcge driver│ │ │ └─────┬─┬──────┘ │ │ │ │ │ │HOST ┌───┴─┴───┐ │ │ │ PCIE RC │ │ └──────┴───┬─┬───┴────────┘ │ │ PCIE │ │ ┌──────┬───┴─┴───┬────────┐ │ │ PCIE EP │ │ │BMC └───┬─┬───┘ │ │ │ │ │ │ ┌────────┴─┴──────────┐ │ │ │ GE │ │ │ │ ┌─────┐ ┌─────┐ │ │ │ │ │ MAC │ │ MAC │ │ │ └─┴─┼─────┼────┼─────┼──┴─┘ │ PHY │ │ PHY │ └─────┘ └─────┘ Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Add support for to read and write registers through the pic bar space. Some driver parameters, such as mac_id, are determined by the board form. Therefore, these parameters are initialized from the register as device specifications. the device specifications register are initialized and written by bmc. driver will read these registers when loading. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

…odule Implements the C22 read and write PHY registers interfaces. Some hardware interfaces related to the PHY are also implemented in this patch. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

The driver supports four interrupts: TX interrupt, RX interrupt, mdio interrupt, and error interrupt. Actually, the driver does not use the mdio interrupt. Therefore, the driver does not request the mdio interrupt. The error interrupt distinguishes different error information by using different masks. To distinguish different errors, the statistics count is added for each error. To ensure the consistency of the code process, masks are added for the TX interrupt and RX interrupt. This patch implements interrupt request, and provides a unified entry for the interrupt handler function. However, the specific interrupt handler function of each interrupt is not implemented currently. Because of pcim_enable_device(), the interrupt vector is already device managed and does not need to be free actively. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Implement the .ndo_open() .ndo_stop() .ndo_set_mac_address() and .ndo_change_mtu functions(). And .ndo_validate_addr calls the eth_validate_addr function directly Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: NipaLocal <nipa@local>

Implement .ndo_start_xmit function to fill the information of the packet to be transmitted into the tx descriptor, and then the hardware will transmit the packet using the information in the tx descriptor. In addition, we also implemented the tx_handler function to enable the tx descriptor to be reused, and .ndo_tx_timeout function to print some information when the hardware is busy. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Implement rx_poll function to read the rx descriptor after receiving the rx interrupt. Adjust the skb based on the descriptor to complete the reception of the packet. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Implement the .get_drvinfo .get_link .get_link_ksettings to get the basic information and working status of the driver. Implement the .set_link_ksettings to modify the rate, duplex, and auto-negotiation status. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: NipaLocal <nipa@local>

Add a Makefile and update Kconfig to build hibmcge driver. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Add myself as the maintainer for the hibmcge ethernet driver. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

The variable wwan_rtnl_link_ops assign a *bigger* maxtype which leads to a global out-of-bounds read when parsing the netlink attributes. Exactly same bug cause as the oob fixed in commit b33fb5b ("net: qualcomm: rmnet: fix global oob in rmnet_policy"). ================================================================== BUG: KASAN: global-out-of-bounds in validate_nla lib/nlattr.c:388 [inline] BUG: KASAN: global-out-of-bounds in __nla_validate_parse+0x19d7/0x29a0 lib/nlattr.c:603 Read of size 1 at addr ffffffff8b09cb60 by task syz.1.66276/323862 CPU: 0 PID: 323862 Comm: syz.1.66276 Not tainted 6.1.70 kernel-patches#1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x177/0x231 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:284 [inline] print_report+0x14f/0x750 mm/kasan/report.c:395 kasan_report+0x139/0x170 mm/kasan/report.c:495 validate_nla lib/nlattr.c:388 [inline] __nla_validate_parse+0x19d7/0x29a0 lib/nlattr.c:603 __nla_parse+0x3c/0x50 lib/nlattr.c:700 nla_parse_nested_deprecated include/net/netlink.h:1269 [inline] __rtnl_newlink net/core/rtnetlink.c:3514 [inline] rtnl_newlink+0x7bc/0x1fd0 net/core/rtnetlink.c:3623 rtnetlink_rcv_msg+0x794/0xef0 net/core/rtnetlink.c:6122 netlink_rcv_skb+0x1de/0x420 net/netlink/af_netlink.c:2508 netlink_unicast_kernel net/netlink/af_netlink.c:1326 [inline] netlink_unicast+0x74b/0x8c0 net/netlink/af_netlink.c:1352 netlink_sendmsg+0x882/0xb90 net/netlink/af_netlink.c:1874 sock_sendmsg_nosec net/socket.c:716 [inline] __sock_sendmsg net/socket.c:728 [inline] ____sys_sendmsg+0x5cc/0x8f0 net/socket.c:2499 ___sys_sendmsg+0x21c/0x290 net/socket.c:2553 __sys_sendmsg net/socket.c:2582 [inline] __do_sys_sendmsg net/socket.c:2591 [inline] __se_sys_sendmsg+0x19e/0x270 net/socket.c:2589 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x45/0x90 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f67b19a24ad RSP: 002b:00007f67b17febb8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f67b1b45f80 RCX: 00007f67b19a24ad RDX: 0000000000000000 RSI: 0000000020005e40 RDI: 0000000000000004 RBP: 00007f67b1a1e01d R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd2513764f R14: 00007ffd251376e0 R15: 00007f67b17fed40 </TASK> The buggy address belongs to the variable: wwan_rtnl_policy+0x20/0x40 The buggy address belongs to the physical page: page:ffffea00002c2700 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xb09c flags: 0xfff00000001000(reserved|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000001000 ffffea00002c2708 ffffea00002c2708 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner info is not present (never set?) Memory state around the buggy address: ffffffff8b09ca00: 05 f9 f9 f9 05 f9 f9 f9 00 01 f9 f9 00 01 f9 f9 ffffffff8b09ca80: 00 00 00 05 f9 f9 f9 f9 00 00 03 f9 f9 f9 f9 f9 >ffffffff8b09cb00: 00 00 00 00 05 f9 f9 f9 00 00 00 00 f9 f9 f9 f9 ^ ffffffff8b09cb80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== According to the comment of `nla_parse_nested_deprecated`, use correct size `IFLA_WWAN_MAX` here to fix this issue. Fixes: 88b7105 ("wwan: add interface creation support") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Loic Poulain <loic.poulain@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

The only caller of __fib_validate_source() is fib_validate_source(), so we can combine fib_validate_source() into __fib_validate_source(), and make fib_validate_source() an inline call to __fib_validate_source(). This will make it easier to make fib_validate_source() return drop reasons in the next patch. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: NipaLocal <nipa@local>

In this commit, we make __fib_validate_source return -reason instead of errno on error. The return value of __fib_validate_source can be -errno, 0, and 1. It's hard to make __fib_validate_source() return drop reasons directly. The __fib_validate_source() will return 1 if the scope of the source(revert) route is HOST. And the __mkroute_input() will mark the skb with IPSKB_DOREDIRECT in this case (combine with some other conditions). And then, a REDIRECT ICMP will be sent in ip_forward() if this flag exists. We can't pass this information to __mkroute_input if we make __fib_validate_source() return drop reasons. However, we can make fib_validate_source() return drop reasons, and call __fib_validate_source() directly in __mkroute_input(). In the origin logic, LINUX_MIB_IPRPFILTER will be counted if __fib_validate_source() return -EXDEV. And now, we need to adjust it by checking "reason == SKB_DROP_REASON_IP_RPFILTER". However, this will take effect only after the patch "net: ip: make ip_route_input_noref() return drop reasons", as we can't pass the drop reasons from fib_validate_source() to ip_rcv_finish_core() in this patch. Following new drop reasons are added in this patch: SKB_DROP_REASON_IP_LOCAL_SOURCE SKB_DROP_REASON_IP_INVALID_SOURCE Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: NipaLocal <nipa@local>

Make ip_route_input_mc() return drop reason, and adjust the call of it in ip_route_input_rcu(). Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: NipaLocal <nipa@local>

Make ip_mc_validate_source() return drop reason, and adjust the call of it in ip_route_input_mc(). Another caller of it is ip_rcv_finish_core->udp_v4_early_demux, and the errno is not checked in detail, so we don't do more adjustment for it. The drop reason "SKB_DROP_REASON_IP_LOCALNET" is added in this commit. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: NipaLocal <nipa@local>

In this commit, we make ip_route_input_slow() return skb drop reasons, and following new skb drop reasons are added: SKB_DROP_REASON_IP_INVALID_DEST The only caller of ip_route_input_slow() is ip_route_input_rcu(), and we adjust it by making it return -EINVAL on error. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: NipaLocal <nipa@local>

In this commit, we make ip_route_input_rcu() return drop reasons, which come from ip_route_input_mc() and ip_route_input_slow(). The only caller of ip_route_input_rcu() is ip_route_input_noref(). We adjust it by making it return -EINVAL on error and ignore the reasons that ip_route_input_rcu() returns. In the following patch, we will make ip_route_input_noref() returns the drop reasons. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: NipaLocal <nipa@local>

In this commit, we make ip_route_input_noref() return drop reasons, which come from ip_route_input_rcu(). We need adjust the callers of ip_route_input_noref() to make sure the return value of ip_route_input_noref() is used properly. The errno that ip_route_input_noref() returns comes from ip_route_input and bpf_lwt_input_reroute in the origin logic, and we make them return -EINVAL on error instead. In the following patch, we will make ip_route_input() returns drop reasons too. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: NipaLocal <nipa@local>

In this commit, we make ip_route_input() return skb drop reasons that come from ip_route_input_noref(). Meanwhile, adjust all the call to it. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: NipaLocal <nipa@local>

In this commit, we make ip_mkroute_input() and __mkroute_input() return drop reasons. The drop reason "SKB_DROP_REASON_ARP_PVLAN_DISABLE" is introduced for the case: the packet which is not IP is forwarded to the in_dev, and the proxy_arp_pvlan is not enabled. This name is ugly, and I have not figure out a suitable name for this case yet :/ Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: NipaLocal <nipa@local>

In this commit, we make ip_route_use_hint() return drop reasons. The drop reasons that we return are similar to what we do in ip_route_input_slow(), and no drop reasons are added in this commit. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: NipaLocal <nipa@local>

Some workloads hit the infamous dev_watchdog() message: "NETDEV WATCHDOG: eth0 (xxxx): transmit queue XX timed out" It seems possible to hit this even for perfectly normal BQL enabled drivers: 1) Assume a TX queue was idle for more than dev->watchdog_timeo (5 seconds unless changed by the driver) 2) Assume a big packet is sent, exceeding current BQL limit. 3) Driver ndo_start_xmit() puts the packet in TX ring, and netdev_tx_sent_queue() is called. 4) QUEUE_STATE_STACK_XOFF could be set from netdev_tx_sent_queue() before txq->trans_start has been written. 5) txq->trans_start is written later, from netdev_start_xmit() if (rc == NETDEV_TX_OK) txq_trans_update(txq) dev_watchdog() running on another cpu could read the old txq->trans_start, and then see QUEUE_STATE_STACK_XOFF, because 5) did not happen yet. To solve the issue, write txq->trans_start right before one XOFF bit is set : - _QUEUE_STATE_DRV_XOFF from netif_tx_stop_queue() - __QUEUE_STATE_STACK_XOFF from netdev_tx_sent_queue() From dev_watchdog(), we have to read txq->state before txq->trans_start. Add memory barriers to enforce correct ordering. In the future, we could avoid writing over txq->trans_start for normal operations, and rename this field to txq->xoff_start_time. Fixes: bec251b ("net: no longer stop all TX queues in dev_watchdog()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: NipaLocal <nipa@local>

Bonding only supports native XDP for specific modes, which can lead to confusion for users regarding why XDP loads successfully at times and fails at others. This patch enhances error handling by returning detailed error messages, providing users with clearer insights into the specific reasons for the failure when loading native XDP. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: NipaLocal <nipa@local>

When a slave already has an XDP program loaded, the correct return value should be -EEXIST instead of -EOPNOTSUPP. Fixes: 9e2ee5c ("net, bonding: Add XDP support to the bonding driver") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: NipaLocal <nipa@local>

Add document about which modes have native XDP support. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: NipaLocal <nipa@local>

RTC fields are located in bits [1:0]. Correct the _MASK and _SHIFT macros to use the appropriate mask and shift. Fixes: 35f74c0 ("stmmac: add GMAC4 DMA/CORE Header File") Signed-off-by: Ley Foon Tan <leyfoon.tan@starfivetech.com> Signed-off-by: NipaLocal <nipa@local>

In order to mask off the bits, we need to use the '~' operator to invert all the bits of _MASK and clear them. Fixes: 48863ce ("stmmac: add DMA support for GMAC 4.xx") Signed-off-by: Ley Foon Tan <leyfoon.tan@starfivetech.com> Signed-off-by: NipaLocal <nipa@local>

…rrupt summary The Receive Watchdog Timeout (RWT, bit[9]) is not part of Abnormal Interrupt Summary (AIS). Move the RWT handling out of the AIS condition statement. From databook, the AIS is the logical OR of the following interrupt bits: - Bit 1: Transmit Process Stopped - Bit 7: Receive Buffer Unavailable - Bit 8: Receive Process Stopped - Bit 10: Early Transmit Interrupt - Bit 12: Fatal Bus Error - Bit 13: Context Descriptor Error Fixes: 48863ce ("stmmac: add DMA support for GMAC 4.xx") Signed-off-by: Ley Foon Tan <leyfoon.tan@starfivetech.com> Signed-off-by: NipaLocal <nipa@local>

… from register values The high address will display as 0 if the driver does not set the reg_space[]. To fix this, read the high address registers and update the reg_space[] accordingly. Fixes: fbf6822 ("net: stmmac: unify registers dumps methods") Signed-off-by: Ley Foon Tan <leyfoon.tan@starfivetech.com> Signed-off-by: NipaLocal <nipa@local>

The function pcim_iounmap_regions() is problematic because it uses a bitmask mechanism to release / iounmap multiple BARs at once. It, thus, prevents getting rid of the problematic iomap table mechanism which was deprecated in commit e354bb8 ("PCI: Deprecate pcim_iomap_table(), pcim_iomap_regions_request_all()"). pcim_iounmap_region() does not have that problem. Make it public as the successor of pcim_iounmap_regions(). Signed-off-by: Philipp Stanner <pstanner@redhat.com> Signed-off-by: NipaLocal <nipa@local>

pcim_ionumap_region() has recently been made a public function and does not have the disadvantage of having to deal with the legacy iomap table, as pcim_iounmap_regions() does. Deprecate pcim_iounmap_regions(). Signed-off-by: Philipp Stanner <pstanner@redhat.com> Signed-off-by: NipaLocal <nipa@local>

pcim_iomap_regions() and pcim_iomap_table() have been deprecated by the PCI subsystem in commit e354bb8 ("PCI: Deprecate pcim_iomap_table(), pcim_iomap_regions_request_all()"). Port dfl-pci.c to the successor, pcim_iomap_region(). Consistently, replace pcim_iounmap_regions() with pcim_iounmap_region(). Signed-off-by: Philipp Stanner <pstanner@redhat.com> Reviewed-by: Andy Shevchenko <andy@kernel.org> Acked-by: Xu Yilun <yilun.xu@intel.com> Signed-off-by: NipaLocal <nipa@local>

pcim_iomap_regions() and pcim_iomap_table() have been deprecated by the PCI subsystem in commit e354bb8 ("PCI: Deprecate pcim_iomap_table(), pcim_iomap_regions_request_all()"). In mtip32xx, these functions can easily be replaced by their respective successors, pcim_request_region() and pcim_iomap(). Moreover, the driver's calls to pcim_iounmap_regions() in probe()'s error path and in remove() are not necessary. Cleanup can be performed by PCI devres automatically. Replace pcim_iomap_regions() and pcim_iomap_table(). Remove the calls to pcim_iounmap_regions(). Signed-off-by: Philipp Stanner <pstanner@redhat.com> Reviewed-by: Andy Shevchenko <andy@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: NipaLocal <nipa@local>

pcim_iomap_regions() and pcim_iomap_table() have been deprecated by the PCI subsystem in commit e354bb8 ("PCI: Deprecate pcim_iomap_table(), pcim_iomap_regions_request_all()"). Replace those functions with calls to pcim_iomap_region(). Signed-off-by: Philipp Stanner <pstanner@redhat.com> Reviewed-by: Andy Shevchenko <andy@kernel.org> Acked-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Signed-off-by: NipaLocal <nipa@local>

pcim_iomap_regions() and pcim_iomap_table() have been deprecated by the PCI subsystem in commit e354bb8 ("PCI: Deprecate pcim_iomap_table(), pcim_iomap_regions_request_all()"). Replace the deprecated PCI functions with their successors. Signed-off-by: Philipp Stanner <pstanner@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: NipaLocal <nipa@local>

Static analysis on linux-next has detected the following issue in function virtnet_stats_ctx_init, in drivers/net/virtio_net.c : if (vi->device_stats_cap & VIRTIO_NET_STATS_TYPE_CVQ) { queue_type = VIRTNET_Q_TYPE_CQ; ctx->bitmap[queue_type] |= VIRTIO_NET_STATS_TYPE_CVQ; ctx->desc_num[queue_type] += ARRAY_SIZE(virtnet_stats_cvq_desc); ctx->size[queue_type] += sizeof(struct virtio_net_stats_cvq); } ctx->bitmap is declared as a u32 however it is being bit-wise or'd with VIRTIO_NET_STATS_TYPE_CVQ and this is defined as 1 << 32: include/uapi/linux/virtio_net.h:#define VIRTIO_NET_STATS_TYPE_CVQ (1ULL << 32) ..and hence the bit-wise or operation won't set any bits in ctx->bitmap because 1ULL < 32 is too wide for a u32. In fact, the field is read into a u64: u64 offset, bitmap; .... bitmap = ctx->bitmap[queue_type]; so to fix, it is enough to make bitmap an array of u64. Fixes: 941168f ("virtio_net: support device stats") Reported-by: "Colin King (gmail)" <colin.i.king@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: NipaLocal <nipa@local>

Introduce `esw_qos_create_group_sched_elem` to handle the creation of group scheduling elements for E-Switch QoS, Transmit Scheduling Arbiter (TSAR). This reduces duplication and simplifies code for TSAR setup. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Signed-off-by: NipaLocal <nipa@local>

Introduce the `sched_node_type` enum to represent both the group and its members as scheduling nodes in the rate hierarchy. Add the `type` field to the rate group structure to specify the type of the node membership in the rate hierarchy. Generalize comments to reflect this flexibility within the rate group structure. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Signed-off-by: NipaLocal <nipa@local>

Introduce a `parent` field in the `mlx5_esw_rate_group` structure to support hierarchical group relationships. The `parent` can reference another group or be set to `NULL`, indicating the group is connected to the root TSAR. This change enables the ability to manage groups in a hierarchical structure for future enhancements. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Signed-off-by: NipaLocal <nipa@local>

Update the logic for adding rate groups to the E-Switch domain list, ensuring only groups with the root Transmit Scheduling Arbiter as their parent are included. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Signed-off-by: NipaLocal <nipa@local>

Rename the `group` field in the `mlx5_vport` structure to `parent` to clarify the vport's role as a member of a parent group and distinguish it from the concept of a general group. Additionally, rename `group_entry` to `parent_entry` to reflect this update. This distinction will be important for handling more complex group structures and scheduling elements. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Signed-off-by: NipaLocal <nipa@local>

Introduce the `mlx5_esw_sched_node` struct, consolidating all rate hierarchy related details, including membership and scheduling parameters. Since the group concept aligns with the `mlx5_esw_sched_node`, replace the `mlx5_esw_rate_group` struct with it and rename the "group" terminology to "node" throughout the rate hierarchy. All relevant code paths and structures have been updated to use the "node" terminology accordingly, laying the groundwork for future patches that will unify the handling of different types of members within the rate hierarchy. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Signed-off-by: NipaLocal <nipa@local>

Modify the vport scheduling element creation function to get the parent node directly, aligning it with the group creation function. This ensures a consistent flow for scheduling elements creation, as the parent nodes already contain the device and parent element index. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Refactor the vport QoS structure by moving group membership and scheduling details into the `mlx5_esw_sched_node` structure. This change consolidates the vport into the rate hierarchy by unifying the handling of different types of scheduling element nodes. In addition, add a direct reference to the mlx5_vport within the mlx5_esw_sched_node structure, to ensure that the vport is easily accessible when a scheduling node is associated with a vport. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Remove the `enabled` flag from the `vport->qos` struct, as QoS now relies solely on the `sched_node` pointer to determine whether QoS features are in use. Currently, the vport `qos` struct consists only of the `sched_node`, introducing an unnecessary two-level reference. However, the qos struct is retained as it will be extended in future patches to support new QoS features. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Simplify the configuration of QoS scheduling elements by removing the separate functions `esw_qos_node_config` and `esw_qos_vport_config`. Instead, directly use the existing `esw_qos_sched_elem_config` function for both nodes and vports. This unification helps in generalizing operations on scheduling elements nodes. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Refactor QoS normalization and rate calculation functions to operate on mlx5_esw_sched_node, allowing for generalized handling of both vports and nodes. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

On sync reset flow, firmware may request a PF, which already acknowledged the unload event, to move to drop mode. Drop mode means that this PF will reduce polling frequency, as this PF is not going to have another active part in the reset, but only reload back after the reset. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Currently, when VFs are created, two flow tables are added for the eswitch: the "fdb" table, which contains rules for each VF and the "vepa_fdb" table. In the default VEB mode, the vepa_fdb table is empty. When switching to VEPA mode, flow steering rules are added to vepa_fdb. Even though the vepa_fdb table is empty in VEB mode, its presence adds some cost to packet processing. In some workloads, this leads to drops which are reported by the rx_discards_phy ethtool counter. In order to improve performance, only create vepa_fdb when in VEPA mode. Tests were done on a ConnectX-6 Lx adapter forwarding 64B packets between both ports using dpdk-testpmd. Numbers are Rx-pps for each port, as reported by testpmd. Without changes: traffic to unknown mac testpmd on PF numvfs=0,0 35257998,35264499 numvfs=1,1 24590124,24590888 testpmd on VF with numvfs=1,1 20434338,20434887 traffic to VF mac testpmd on VF with numvfs=1,1 30341014,30340749 With changes: traffic to unknown mac testpmd on PF numvfs=0,0 35404361,35383378 numvfs=1,1 29801247,29790757 testpmd on VF with numvfs=1,1 24310435,24309084 traffic to VF mac testpmd on VF with numvfs=1,1 34811436,34781706 Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

As preparation for HW Steering support, rename packet reformat struct member action to fs_dr_action, to distinguish from fs_hws_action which will be added. Add a pointer where needed to keep code line shorter and more readable. Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

As preparation for HW Steering support, rename modify header struct member action to fs_dr_action, to distinguish from fs_hws_action which will be added. Add a pointer where needed to keep code line shorter and more readable. Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

That should make it possible to do expected payload validation on the caller side. Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

So we can plug the other ones in the future if needed. Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

There is a bunch of places where error() calls look out of place. Use the same error(1, errno, ...) pattern everywhere. Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

Support 3-tuple filtering by making client_ip optional. When -c is not passed, don't specify src-ip/src-port in the filter. Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

To make it clear what's required and what's not. Also, some of the values don't seem like a good defaults; for example eth1. Move the invocation comment to the top, add missing -s to the client and cleanup the client invocation a bit to make more readable. Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

Use dualstack socket to support both v4 and v6. v4-mapped-v6 address can be used to do v4. Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

ntuple off/on might be not enough to do it on all NICs. Add a bunch of shell crap to explicitly remove the rules. Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

In the next patch the hard-coded queue numbers are gonna be removed. So introduce some initial support for ethtool YNL and use it to enable header split. Also, tcp-data-split requires latest ethtool which is unlikely to be present in the distros right now. (ideally, we should not shell out to ethtool at all). Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

Use single last queue of the device and probe it dynamically. Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

…provided This will be used as a 'probe' mode in the selftest to check whether the device supports the devmem or not. Use hard-coded queue layout (two last queues) and prevent user from passing custom -q and/or -t. Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

This is where all the tests that depend on the HW functionality live in and this is where the automated test is gonna be added in the next patch. Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

Only RX side for now and small message to test the setup. In the future, we can extend it to TX side and to testing both sides with a couple of megs of data. make \ -C tools/testing/selftests \ TARGETS="drivers/hw/net" \ install INSTALL_PATH=~/tmp/ksft scp ~/tmp/ksft ${HOST}: scp ~/tmp/ksft ${PEER}: cfg+="NETIF=${DEV}\n" cfg+="LOCAL_V6=${HOST_IP}\n" cfg+="REMOTE_V6=${PEER_IP}\n" cfg+="REMOTE_TYPE=ssh\n" cfg+="REMOTE_ARGS=root@${PEER}\n" echo -e "$cfg" | ssh root@${HOST} "cat > ksft/drivers/net/net.config" ssh root@${HOST} "cd ksft && ./run_kselftest.sh -t drivers/net:devmem.py" Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: NipaLocal <nipa@local>

…kernel/git/bluetooth/bluetooth bluetooth pull request for net: - ISO: Fix multiple init when debugfs is disabled - Call iso_exit() on module unload - Remove debugfs directory on module init failure - btusb: Fix not being able to reconnect after suspend - btusb: Fix regression with fake CSR controllers 0a12:0001 - bnep: fix wild-memory-access in proto_unregister Signed-off-by: NipaLocal <nipa@local> # -----BEGIN PGP SIGNATURE----- # # iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmcQJGUZHGx1aXoudm9u # LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKeWiD/9WIrQRp7K10hc7Q9vjyex+ # DA+yv3EaFbqk//BIXOAg3En+HjhrDeaIkrt0L3m5D+/Y5K/oGQeO0ffOu4aF3So2 # D4Q8ZDQHXJ7pxG2hVMHbBBUikiKbt8HATMdQDutJnDtIpxxtB8ugTMm/pRG57Nq8 # QjWdy5h2aYm5NwPuu/ErY26UpCljMoOrKAMiWQ539AkxQJGVN4hp4l6mVrZdnjc5 # MmL1+S1r/S3PBxP5Mw3VX58c7J1Ki28d80BF+1C4qDGjKXdLoqYpCy2CRyiqlFSo # 9yU37Uh1AeobbWGC8/7dIKN3PoXe1UQRhF56EdWAeTbz+yr0c8sRYsOs9SGEyWlj # +PWFkpXF8Dpxs96BHf0iLy4XxtByJsdmGaUSSmdktUK1WWEOa8I51xtR0xVswOqr # BXZ9ATNDZjed5DFItwyj3YYoh0L7Bid594ZuAbxHRUCOnaFNR3DZbNVe2+uZfo5l # Lc2L/Vfa4mwuQtH0psMvXAvvywuGNGv9cjK5PwVyRwqsppcco5oV9LckmGMMnWoO # xsU+/IMljHORL0HPvL+RaSBDprj7GgA7nDEuv7zyYMC/uWHv5olUvLNEdN/LtCaB # S9mVlBpkB2yxUT8FvZg71UEK0FVix2IaFcSF8FZcJqHw+ureRH0mhwe7BUytMvH1 # 7IbfeYSu7h6fr6wnGrW8dQ== # =k8sF # -----END PGP SIGNATURE----- # gpg: Signature made Wed 16 Oct 2024 01:39:01 PM PDT # gpg: using RSA key EC4EA8457A7CC34E68BD8AFFF49080E310320B29 # gpg: issuer "luiz.von.dentz@intel.com" # gpg: Can't check signature: No public key

Currently reset state configuration of split header works fine for non-tagged packets and we see no corruption in payload of any size We need additional programming sequence with reset configuration to handle VLAN tagged packets to avoid corruption in payload for packets of size greater than 256 bytes. Without this change ping application complains about corruption in payload when the size of the VLAN packet exceeds 256 bytes. With this change tagged and non-tagged packets of any size works fine and there is no corruption seen. Current configuration which has the issue for VLAN packet ---------------------------------------------------------- Split happens at the position at Layer 3 header |MAC-DA|MAC-SA|Vlan Tag|Ether type|IP header|IP data|Rest of the payload| 2 bytes ^ | With the fix we are making sure that the split happens now at Layer 2 which is end of ethernet header and start of IP payload Ip traffic split ----------------- Bits which take care of this are SPLM and SPLOFST SPLM = Split mode is set to Layer 2 SPLOFST = These bits indicate the value of offset from the beginning of Length/Type field at which header split should take place when the appropriate SPLM is selected. Reset value is 2bytes. Un-tagged data (without VLAN) |MAC-DA|MAC-SA|Ether type|IP header|IP data|Rest of the payload| 2bytes ^ | Tagged data (with VLAN) |MAC-DA|MAC-SA|VLAN Tag|Ether type|IP header|IP data|Rest of the payload| 2bytes ^ | Non-IP traffic split such AV packet ------------------------------------ Bits which take care of this are SAVE = Split AV Enable SAVO = Split AV Offset, similar to SPLOFST but this is for AVTP packets. |Preamble|MAC-DA|MAC-SA|VLAN tag|Ether type|IEEE 1722 payload|CRC| 2bytes ^ | Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Report MDI-X resolved state after link up. Tested on Linkstreet 88E6193X internal PHYs. Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: NipaLocal <nipa@local>

It is meant to use xa_err() to extract the error encoded in the return value of xa_store(). Fixes: 44c2fbe ("mlxsw: spectrum_router: Share nexthop counters in resilient groups") Signed-off-by: Yuan Can <yuancan@huawei.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Petr Machata <petrm@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Add selftest file for the link layer tests of a NIC driver. Test for auto-negotiation is added. Add LinkConfig class for changing link layer configs. Selftest makes use of ksft modules and ethtool. Include selftest file in the Makefile. Signed-off-by: Mohan Prasad J <mohan.prasad@microchip.com> Signed-off-by: NipaLocal <nipa@local>

Add selftest case for testing the speed and duplex state of local NIC driver and the partner based on the supported link modes obtained from the ethtool. Speed and duplex states are varied and verified using ethtool. Signed-off-by: Mohan Prasad J <mohan.prasad@microchip.com> Signed-off-by: NipaLocal <nipa@local>

Add selftest case to check the send and receive throughput. Supported link modes between local NIC driver and partner are varied. Then send and receive throughput is captured and verified. Test uses iperf3 tool. Signed-off-by: Mohan Prasad J <mohan.prasad@microchip.com> Signed-off-by: NipaLocal <nipa@local>

The fix for MAC addresses broke detection of the naming convention because it gave network devices no random MAC before bind() was called. This means that the check for the local assignment bit was always negative as the address was zeroed from allocation, instead of from overwriting the MAC with a unique hardware address. The correct check for whether bind() has altered the MAC is done with is_zero_ether_addr Signed-off-by: Oliver Neukum <oneukum@suse.com> Reported-by: Greg Thelen <gthelen@google.com> Diagnosed-by: John Sperbeck <jsperbeck@google.com> Fixes: bab8eb0 ("usbnet: modern method to get random MAC") Signed-off-by: NipaLocal <nipa@local>

In commit 8510801 ("selftests: drv-net: add ability to schedule cleanup with defer()"), a defer helper was added to Python selftests. The idea is to keep cleanup commands close to their dirtying counterparts, thereby making it more transparent what is cleaning up what, making it harder to miss a cleanup, and make the whole cleanup business exception safe. All these benefits are applicable to bash as well, exception safety can be interpreted in terms of safety vs. a SIGINT. This patch therefore introduces a framework of several helpers that serve to schedule cleanups in bash selftests: - defer_scope_push(), defer_scope_pop(): Deferred statements can be batched together in scopes. When a scope is popped, the deferred commands scheduled in that scope are executed in the order opposite to order of their scheduling. - defer(): Schedules a defer to the most recently pushed scope (or the default scope if none was pushed.) - defer_prio(): Schedules a defer on the priority track. The priority defer queue is run before the default defer queue when scope is popped. The issue that this is addressing is specifically the one of restoring devlink shared buffer threshold type. When setting up static thresholds, one has to first change the threshold type to static, then override the individual thresholds. When cleaning up, it would be natural to reset the threshold values first, then change the threshold type. But the values that are valid for dynamic thresholds are generally invalid for static thresholds and vice versa. Attempts to restore the values first would be bounced. Thus one has to first reset the threshold type, then adjust the thresholds. (You could argue that the shared buffer threshold type API is broken and you would be right, but here we are.) This cannot be solved by pure defers easily. I considered making it possible to disable an existing defer, so that one could then schedule a new defer and disable the original. But this forward-shifting of the defer job would have to take place after every threshold-adjusting command, which would make it very awkward to schedule these jobs. - defer_scopes_cleanup(): Pops any unpopped scopes, including the default one. The selftests that use defer should run this in their exit trap. This is important to get cleanups of interrupted scripts. - in_defer_scope(): Sometimes a function would like to introduce a new defer scope, then run whatever it is that it wants to run, and then pop the scope to run the deferred cleanups. The helper in_defer_scope() can be used to run another command within such environment, such that any scheduled defers run after the command finishes. The framework is added as a separate file lib/sh/defer.sh so that it can be used by all bash selftests, including those that do not currently use lib.sh. lib.sh however includes the file by default, because ideally all tests would use these helpers instead of hand-rolling their cleanups. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Consistent use of defers obviates the need for a separate test-specific cleanup function -- everything is just taken care of in defers. So in this patch, introduce a cleanup() helper in the forwarding lib.sh, which calls just pre_cleanup() and defer_scopes_cleanup(). Selftests are obviously still free to override the function. Since pre_cleanup() is too entangled with forwarding-specific minutia, the function cannot currently be in net/lib.sh. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Now that it is possible to schedule a deferral of stop_traffic() right after the traffic is started, we do not have to rely on the %% magic to kill the background process that was started last. Instead we can just give the PID explicitly. This makes it possible to start other background processes after the traffic is started without confusing the cleanup. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Instead of having a suite of dedicated cleanup functions, use the defer framework to schedule cleanups right as their setup functions are run. The sleep after stop_traffic() in mlxsw selftests is necessary, but scheduling it as "defer sleep; defer stop_traffic" is silly. Instead, add a local helper to stop traffic and sleep afterwards. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Use the defer framework to schedule cleanups as soon as the command is executed. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Use the defer framework to schedule cleanups as soon as the command is executed. Note that the start_traffic commands in __burst_test() are each sending a fixed number of packets (note the -c flag) and then ending. They therefore do not need a matching stop_traffic. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

Change ynl-gen-c.py to use NLA_BE16 and NLA_BE32 types to represent big-endian u16 and u32 ynl types. Doing this enables those attributes to have range checks applied, as the validator will then convert to host endianness prior to validation. The autogenerated kernel/uapi code have been regenerated by running: ./tools/net/ynl/ynl-regen.sh -f This changes the policy types of the following attributes: FOU_ATTR_PORT (NLA_U16 -> NLA_BE16) FOU_ATTR_PEER_PORT (NLA_U16 -> NLA_BE16) These two are used with nla_get_be16/nla_put_be16(). MPTCP_PM_ADDR_ATTR_ADDR4 (NLA_U32 -> NLA_BE32) This one is used with nla_get_in_addr/nla_put_in_addr(), which uses nla_get_be32/nla_put_be32(). IOWs the generated changes are AFAICT aligned with their implementations. The generated userspace code remains identical, and have been verified by comparing the output generated by the following command: make -C tools/net/ynl/generated Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Variable msg_ready is useless, since it does not represent anything. Get rid of it, using buf directly instead. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

The send_ext_msg_udp() function has become quite large, currently spanning 102 lines. Its complexity, along with extensive pointer and offset manipulation, makes it difficult to read and error-prone. The function has evolved over time, and it’s now due for a refactor. To improve readability and maintainability, isolate the case where no message fragmentation occurs into a separate function, into a new send_msg_no_fragmentation() function. This scenario covers about 95% of the messages. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Following the previous change, where the non-fragmented case was moved to its own function, this update introduces a new function called send_msg_fragmented to specifically manage scenarios where message fragmentation is required. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

With the introduction of the userdata concept, the term body has become ambiguous and less intuitive. To improve clarity, body is renamed to msg_body, making it clear that the body is not the only content following the header. In an upcoming patch, the term body_len will also be revised for further clarity. The current packet structure is as follows: release, header, body, [msg_body + userdata] Here, [msg_body + userdata] collectively forms what is currently referred to as "body." This renaming helps to distinguish and better understand each component of the packet. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

This new variable tracks the total length of the data to be sent, encompassing both the message body (msgbody) and userdata, which is collectively called body. By explicitly defining body_len, the code becomes clearer and easier to reason about, simplifying offset calculations and improving overall readability of the function. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

The current check to determine if the message body was fully sent is difficult to follow. To improve clarity, introduce a variable that explicitly tracks whether the message body (msgbody) has been completely sent, indicating when it's time to begin sending userdata. Additionally, add comments to make the code more understandable for others who may work with it. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Refactor the code by extracting the logic for appending the release into the buffer into a separate function. The goal is to reduce the size of send_msg_fragmented() and improve code readability. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Do not pass userdata to send_msg_fragmented, since we can get it later. This will be more useful in the next patch, where send_msg_fragmented() will be split even more, and userdata is only necessary in the last function. Suggested-by: Simon Horman <horms@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Refactor the send_msg_fragmented() function by extracting the logic for sending the message body into a new function called send_fragmented_body(). Now, send_msg_fragmented() handles appending the release and header, and then delegates the task of breaking up the body and sending the fragments to send_fragmented_body(). This is the final flow now: When send_ext_msg_udp() is called to send a message, it will: - call send_msg_no_fragmentation() if no fragmentation is needed or - call send_msg_fragmented() if fragmentation is needed * send_msg_fragmented() appends the header to the buffer, which is be persisted until the function returns * call send_fragmented_body() to iterate and populate the body of the message. It will not touch the header, and it will only replace the body, writing the msgbody and/or userdata. Also add some comment to make the code easier to review. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Looks like there are userspace applications which rely on the ability to migrate xfrm-interface policies while not providing the interface id. This worked because old code contains a match workaround: "if_id == 0 || pol->if_id == if_id". The packetpath lookup code uses the id_id as a key into rhashtable; after switch of migrate lookup over to those functions policy lookup fails. Add a workaround: if no policy to migrate is found and userspace provided no if_id (normal for non-interface policies!) do a full search of all policies and try to find one that matches the selector. This is super-slow, so print a warning when we find a policy, so hopefully such userspace requests are fixed up over time to not rely on magic-0-match. In case of setups where userspace frequently tries to migrate non-existent policies with if_id 0 such migrate requests will take much longer to complete. Other options: 1. also add xfrm interface usage counter so we know in advance if the slowpath could potentially find the 'right' policy or not. 2. Completely revert policy insertion patches and go back to linear insertion complexity. 1) seems premature, I'd expect userspace to try migration of policies it has inserted in the past, so either normal fastpath or slowpath should find a match anyway. 2) seems like a worse option to me. Cc: Nathan Harold <nharold@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Fixes: 563d5ca ("xfrm: switch migrate to xfrm_policy_lookup_bytype") Reported-by: Yan Yan <evitayan@google.com> Tested-by: Yan Yan <evitayan@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: NipaLocal <nipa@local>

As a general policy, we refer our generic register definitions over vendor specific definitions. In XPCS, it appears that the register layout follows a BMCR, BMSR and ADVERTISE register definition. We already refer to this BMCR register using several different macros which is confusing. Convert the following register definitions to generic versions: DW_VR_MII_MMD_CTRL => MII_BMCR MDIO_CTRL1 => MII_BMCR AN_CL37_EN => BMCR_ANENABLE SGMII_SPEED_SS6 => BMCR_SPEED1000 SGMII_SPEED_SS13 => BMCR_SPEED100 MDIO_CTRL1_RESET => BMCR_RESET DW_VR_MII_MMD_STS => MII_BMSR DW_VR_MII_MMD_STS_LINK_STS => BMSR_LSTATUS DW_FULL_DUPLEX => ADVERTISE_1000XFULL iDW_HALF_DUPLEX => ADVERTISE_1000XHALF Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: NipaLocal <nipa@local>

Remove an unnecessary switch() statement in xpcs_link_up_1000basex(). The only value this switch statement is interested in is SPEED_1000, all other values lead to an error. Replace this with a simple if() statement. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: NipaLocal <nipa@local>

Rearrange xpcs_link_up_1000basex() to make it more obvious what will happen in the following commit. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: NipaLocal <nipa@local>

We can now see that we have an open-coded version of mii_bmcr_encode_fixed() when this is called with SPEED_1000: val = BMCR_SPEED1000; if (duplex == DUPLEX_FULL) val |= BMCR_FULLDPLX; Replace this with a call to mii_bmcr_encode_fixed(). Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: NipaLocal <nipa@local>

xpcs_link_up_sgmii() and xpcs_link_up_1000basex() are almost identical with the exception of checking the speed and duplex for 1000BASE-X. Combine the two functions. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: NipaLocal <nipa@local>

xpcs_config_usxgmii() is only called from the xpcs_link_up() method, so let's name it similarly to the SGMII and 1000BASEX functions. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: NipaLocal <nipa@local>

While using "return" when calling a void returning function inside a function that returns void doesn't cause a compiler warning, it looks weird. Convert the bunch of if() statements to a switch() and remove these return statements. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: NipaLocal <nipa@local>

BUG: KASAN: slab-use-after-free in __nf_unregister_net_hook+0x640/0x6b0 Read of size 8 at addr ffff8880106fe400 by task repro/72= bpf_nf_link_release+0xda/0x1e0 bpf_link_free+0x139/0x2d0 bpf_link_release+0x68/0x80 __fput+0x414/0xb60 Eric says: It seems that bpf was able to defer the __nf_unregister_net_hook() after exit()/close() time. Perhaps a netns reference is missing, because the netns has been dismantled/freed already. bpf_nf_link_attach() does : link->net = net; But I do not see a reference being taken on net. Add such a reference and release it after hook unreg. Note that I was unable to get syzbot reproducer to work, so I do not know if this resolves this splat. Fixes: 84601d6 ("bpf: add bpf_link support for BPF_NETFILTER programs") Diagnosed-by: Eric Dumazet <edumazet@google.com> Reported-by: Lai, Yi <yi1.lai@linux.intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NipaLocal <nipa@local>

PCI_VDEVICE_SUB generates the pci_device_id struct layout for the specific PCI device/subdevice. The subvendor field is set to PCI_ANY_ID. Private data may follow the output. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com> Signed-off-by: NipaLocal <nipa@local>

Add support for Intel(R) E610 Series of network devices. The E610 is based on X550 but adds firmware managed link, enhanced security capabilities and support for updated server manageability Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com> Signed-off-by: NipaLocal <nipa@local>

Commit 02b34d0 ("netdevsim: add dummy macsec offload") pads u64 number to 8 characters using "%08llx" format specifier. Changing format specifier to "%016llx" ensures that no matter the value the representation of number in log is always the same length. Before this patch, entry in log for value '1' would say: removing SecY with SCI 00000001 at index 2 After this patch is applied, entry in log will say: removing SecY with SCI 0000000000000001 at index 2 Signed-off-by: Ales Nezbeda <anezbeda@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: NipaLocal <nipa@local>

sock_{,re}set_flag() are contained in sock_valbool_flag(), it would be cleaner to just use sock_valbool_flag(). Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: NipaLocal <nipa@local>

Before adding a new line at the end of the temporary buffer in dump_cpumask, a length check is performed to ensure there is space for it. len = min(sizeof(kbuf) - 1, *lenp); len = scnprintf(kbuf, len, ...); if (len < *lenp) kbuf[len++] = '\n'; Note that the check is currently logically wrong, the written length is compared against the output buffer, not the temporary one. However this has no consequence as this is always true, even if fixed: scnprintf includes a null char at the end of the buffer but the returned length do not include it and there is always space for overriding it with a newline. Remove the condition. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

…uffer When computing the length we'll be able to use out of the buffers, one char is removed from the temporary one to make room for a newline. It should be removed from the output buffer length too, but in reality this is not needed as the later call to scnprintf makes sure a null char is written at the end of the buffer which we override with the newline. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: NipaLocal <nipa@local>

This fixes the output of rps_default_mask and flow_limit_cpu_bitmap when the CPU count is > 448, as it was truncated. The underlying values are actually stored correctly when writing to these sysctl but displaying them uses a fixed length temporary buffer in dump_cpumask. This buffer can be too small if the CPU count is > 448. Fix this by dynamically allocating the buffer in dump_cpumask, using a guesstimate of what we need. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

… created by classifiers tcf_action_init() has logic for checking mismatches between action and filter offload flags (skip_sw/skip_hw). AFAIU, this is intended to run on the transition between the new tc_act_bind(flags) returning true (aka now gets bound to classifier) and tc_act_bind(act->tcfa_flags) returning false (aka action was not bound to classifier before). Otherwise, the check is skipped. For the case where an action is not standalone, but rather it was created by a classifier and is bound to it, tcf_action_init() skips the check entirely, and this means it allows mismatched flags to occur. Taking the matchall classifier code path as an example (with mirred as an action), the reason is the following: 1 | mall_change() 2 | -> mall_replace_hw_filter() 3 | -> tcf_exts_validate_ex() 4 | -> flags |= TCA_ACT_FLAGS_BIND; 5 | -> tcf_action_init() 6 | -> tcf_action_init_1() 7 | -> a_o->init() 8 | -> tcf_mirred_init() 9 | -> tcf_idr_create_from_flags() 10 | -> tcf_idr_create() 11 | -> p->tcfa_flags = flags; 12 | -> tc_act_bind(flags)) 13 | -> tc_act_bind(act->tcfa_flags) When invoked from tcf_exts_validate_ex() like matchall does (but other classifiers validate their extensions as well), tcf_action_init() runs in a call path where "flags" always contains TCA_ACT_FLAGS_BIND (set by line 4). So line 12 is always true, and line 13 is always true as well. No transition ever takes place, and the check is skipped. The code was added in this form in commit c86e020 ("flow_offload: validate flags of filter and actions"), but I'm attributing the blame even earlier in that series, to when TCA_ACT_FLAGS_SKIP_HW and TCA_ACT_FLAGS_SKIP_SW were added to the UAPI. Following the development process of this change, the check did not always exist in this form. A change took place between v3 [1] and v4 [2], AFAIU due to review feedback that it doesn't make sense for action flags to be different than classifier flags. I think I agree with that feedback, but it was translated into code that omits enforcing this for "classic" actions created at the same time with the filters themselves. There are 3 more important cases to discuss. First there is this command: $ tc qdisc add dev eth0 clasct $ tc filter add dev eth0 ingress matchall skip_sw \ action mirred ingress mirror dev eth1 which should be allowed, because prior to the concept of dedicated action flags, it used to work and it used to mean the action inherited the skip_sw/skip_hw flags from the classifier. It's not a mismatch. Then we have this command: $ tc qdisc add dev eth0 clasct $ tc filter add dev eth0 ingress matchall skip_sw \ action mirred ingress mirror dev eth1 skip_hw where there is a mismatch and it should be rejected. Finally, we have: $ tc qdisc add dev eth0 clasct $ tc filter add dev eth0 ingress matchall skip_sw \ action mirred ingress mirror dev eth1 skip_sw where the offload flags coincide, and this should be treated the same as the first command based on inheritance, and accepted. [1]: https://lore.kernel.org/netdev/20211028110646.13791-9-simon.horman@corigine.com/ [2]: https://lore.kernel.org/netdev/20211118130805.23897-10-simon.horman@corigine.com/ Fixes: 7adc576 ("flow_offload: add skip_hw and skip_sw to control if offload the action") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: NipaLocal <nipa@local>

This isn't used outside act_api.c, but is called by tcf_dump_walker() prior to its definition. So move it upwards and make it static. Simultaneously, reorder the variable declarations so that they follow the networking "reverse Christmas tree" coding style. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: NipaLocal <nipa@local>

Background: switchdev ports offload the Linux bridge, and most of the packets they handle will never see the CPU. The ports between which there exists no hardware data path are considered 'foreign' to switchdev. These can either be normal physical NICs without switchdev offload, or incompatible switchdev ports, or virtual interfaces like veth/dummy/etc. In some cases, an offloaded filter can only do half the work, and the rest must be handled by software. Redirecting/mirroring from the ingress of a switchdev port towards a foreign interface is one example of combined hardware/software data path. The most that the switchdev port can do is to extract the matching packets from its offloaded data path and send them to the CPU. From there on, the software filter runs (a second time, after the first run in hardware) on the packet and performs the mirred action. It makes sense for switchdev drivers which allow this kind of "half offloading" to sense the "skip_sw" flag of the filter/action pair, and deny attempts from the user to install a filter that does not run in software, because that simply won't work. In fact, a mirred action on a switchdev port towards a dummy interface appears to be a valid way of (selectively) monitoring offloaded traffic that flows through it. IFF_PROMISC was also discussed years ago, but (despite initial disagreement) there seems to be consensus that this flag should not affect the destination taken by packets, but merely whether or not the NIC discards packets with unknown MAC DA for local processing. Only the flower and matchall classifiers are of interest to me for purely pragmatic reasons: these are offloaded by DSA currently. [1] https://lore.kernel.org/netdev/20190830092637.7f83d162@ceranb/ [2] https://lore.kernel.org/netdev/20191002233750.13566-1-olteanv@gmail.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: NipaLocal <nipa@local>

The body is a bit hard to read, hard to extend, and has duplicated conditions. Clean up the "if (many conditions) else if (many conditions, some of them repeated)" pattern by: - Moving the repeated conditions out - Replacing the repeated tests for the same variable with a switch/case - Moving the protocol check inside the dsa_user_add_cls_matchall_mirred() function call. This is pure refactoring, no logic has been changed, though some tests were reordered. The order does not matter - they are independent things to be tested for. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: NipaLocal <nipa@local>

We already have an "extack" stack variable in dsa_user_add_cls_matchall_police() and dsa_user_add_cls_matchall_mirred(), there is no need to retrieve it again from cls->common.extack. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: NipaLocal <nipa@local>

Do not leave -EOPNOTSUPP errors without an explanation. It is confusing for the user to figure out what is wrong otherwise. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: NipaLocal <nipa@local>

If the CPU bandwidth capacity permits, it may be useful to mirror the entire ingress of a user port to software. This is in fact possible to express even if there is no net_device representation for the CPU port. In fact, that approach was already exhausted and that representation wouldn't have even helped [1]. The idea behind implementing this is that currently, we refuse to offload any mirroring towards a non-DSA target net_device. But if we acknowledge the fact that to reach any foreign net_device, the switch must send the packet to the CPU anyway, then we can simply offload just that part, and let the software do the rest. There is only one condition we need to uphold: the filter needs to be present in the software data path as well (no skip_sw). There are 2 actions to consider: FLOW_ACTION_MIRRED (redirect to egress of target interface) and FLOW_ACTION_MIRRED_INGRESS (redirect to ingress of target interface). We don't have the ability/API to offload FLOW_ACTION_MIRRED_INGRESS when the target port is also a DSA user port, but we could also permit that through mirred to the CPU + software. Example: $ ip link add dummy0 type dummy; ip link set dummy0 up $ tc qdisc add dev swp0 clsact $ tc filter add dev swp0 ingress matchall action mirred ingress mirror dev dummy0 Any DSA driver with a ds->ops->port_mirror_add() implementation can now make use of this with no additional change. [1] https://lore.kernel.org/netdev/20191002233750.13566-1-olteanv@gmail.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: NipaLocal <nipa@local>

…rfaces Debugging certain flows in the offloaded switch data path can be done by installing two tc-mirred filters for mirroring: one in the hardware data path, which copies the frames to the CPU, and one which takes the frame from there and mirrors it to a virtual interface like a dummy device, where it can be seen with tcpdump. The effect of having 2 filters run on the same packet can be obtained by default using tc, by not specifying either the 'skip_sw' or 'skip_hw' keywords. Instead of refusing to offload mirroring/redirecting packets towards interfaces that aren't switch ports, just treat every other destination for what it is: something that is handled in software, behind the CPU port. Usage: $ ip link add dummy0 type dummy; ip link set dummy0 up $ tc qdisc add dev swp0 clsact $ tc filter add dev swp0 ingress protocol ip flower action mirred ingress mirror dev dummy0 Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: NipaLocal <nipa@local>

kernel test robot reported a section mismatch in ip6_mr_cleanup(). WARNING: modpost: vmlinux: section mismatch in reference: ip6_mr_cleanup+0x0 (section: .text) -> 0xffffffff (section: .init.rodata) WARNING: modpost: vmlinux: section mismatch in reference: ip6_mr_cleanup+0x14 (section: .text) -> ip6mr_rtnl_msg_handlers (section: .init.rodata) ip6_mr_cleanup() uses ip6mr_rtnl_msg_handlers[] that has __initconst_or_module qualifier. ip6_mr_cleanup() is only called from inet6_init() but does not have __init qualifier. Let's add __init to ip6_mr_cleanup(). Fixes: 3ac84e3 ("ipmr: Use rtnl_register_many().") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202410180139.B3HeemsC-lkp@intel.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: NipaLocal <nipa@local>

We will convert addr_doit() and getaddr_dumpit() to RCU, both of which call fill_addr(). The former will call phonet_address_notify() outside of RCU due to GFP_KERNEL, so dev will not be available in fill_addr(). Let's pass ifindex directly to fill_addr(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NipaLocal <nipa@local>

Currently, phonet_address_notify() fetches netns and ifindex from dev. Once addr_doit() is converted to RCU, phonet_address_notify() will be called outside of RCU due to GFP_KERNEL, and dev will be unavailable there. Let's pass net and ifindex to phonet_address_notify(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NipaLocal <nipa@local>

addr_doit() calls phonet_address_add() or phonet_address_del() for RTM_NEWADDR or RTM_DELADDR, respectively. Both functions only touch phonet_device_list(dev_net(dev)), which is currently protected by RTNL and its dedicated mutex, phonet_device_list.lock. We will convert addr_doit() to RCU and cannot use mutex inside RCU. Let's convert the mutex to spinlock_t. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NipaLocal <nipa@local>

Now only __dev_get_by_index() depends on RTNL in addr_doit(). Let's use dev_get_by_index_rcu() and register addr_doit() with RTNL_FLAG_DOIT_UNLOCKED. While at it, I changed phonet_rtnl_msg_handlers[]'s init to C99 style like other core networking code. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NipaLocal <nipa@local>

getaddr_dumpit() already relies on RCU and does not need RTNL. Let's use READ_ONCE() for ifindex and register getaddr_dumpit() with RTNL_FLAG_DUMP_UNLOCKED. While at it, the retval of getaddr_dumpit() is changed to combine NLMSG_DONE and save recvmsg() as done in 58a4ff5 ("phonet: no longer hold RTNL in route_dumpit()"). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NipaLocal <nipa@local>

We will convert route_doit() to RCU. route_doit() will call rtm_phonet_notify() outside of RCU due to GFP_KERNEL, so dev will not be available in fill_route(). Let's pass ifindex directly to fill_route(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NipaLocal <nipa@local>

Currently, rtm_phonet_notify() fetches netns and ifindex from dev. Once route_doit() is converted to RCU, rtm_phonet_notify() will be called outside of RCU due to GFP_KERNEL, and dev will be unavailable there. Let's pass net and ifindex to rtm_phonet_notify(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NipaLocal <nipa@local>

route_doit() calls phonet_route_add() or phonet_route_del() for RTM_NEWROUTE or RTM_DELROUTE, respectively. Both functions only touch phonet_pernet(dev_net(dev))->routes, which is currently protected by RTNL and its dedicated mutex, phonet_routes.lock. We will convert route_doit() to RCU and cannot use mutex inside RCU. Let's convert the mutex to spinlock_t. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NipaLocal <nipa@local>

Now only __dev_get_by_index() depends on RTNL in route_doit(). Let's use dev_get_by_index_rcu() and register route_doit() with RTNL_FLAG_DOIT_UNLOCKED. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NipaLocal <nipa@local>

Add error pointer check after calling otx2_mbox_get_rsp(). Fixes: ab58a41 ("octeontx2-pf: cn10k: Get max mtu supported from admin function") Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Add error pointer check after calling otx2_mbox_get_rsp(). Fixes: 75f3627 ("octeontx2-pf: Support to enable/disable pause frames via ethtool") Fixes: d0cf950 ("octeontx2-pf: ethtool fec mode support") Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Adding error pointer check after calling otx2_mbox_get_rsp(). Fixes: 9917060 ("octeontx2-pf: Cleanup flow rule management") Fixes: f0a1913 ("octeontx2-pf: Add support for ethtool ntuple filters") Fixes: 674b3e1 ("octeontx2-pf: Add additional checks while configuring ucast/bcast/mcast rules") Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Add error pointer check after calling otx2_mbox_get_rsp(). Fixes: 2ca89a2 ("octeontx2-pf: TC_MATCHALL ingress ratelimiting offload") Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Add error pointer checks after calling otx2_mbox_get_rsp(). Fixes: 79d2be3 ("octeontx2-pf: offload DMAC filters to CGX/RPM block") Fixes: fa5e0cc ("octeontx2-pf: Add support for exact match table.") Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Add error pointer check after calling otx2_mbox_get_rsp(). Fixes: 8e67558 ("octeontx2-pf: PFC config support with DCBx") Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Recently, commit 4a0ec2a ("ipv6: switch inet6_addr_hash() to less predictable hash") and commit 4daf4dc ("ipv6: switch inet6_acaddr_hash() to less predictable hash") hardened IPv6 address hash functions. inet_addr_hash() is also highly predictable, and a malicious use could abuse a specific bucket. Let's follow the change on IPv4 by using jhash_1word(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: NipaLocal <nipa@local>

igc_xdp_run_prog() converts customed xdp action to a negative error code with the sk_buff pointer type which be checked with IS_ERR in igc_clean_rx_irq(). Remove this error pointer handing instead use plain int return value to fix this smatch warnings: drivers/net/ethernet/intel/igc/igc_main.c:2533 igc_xdp_run_prog() warn: passing zero to 'ERR_PTR' Fixes: 2657510 ("igc: Add initial XDP support") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: NipaLocal <nipa@local>

igb_run_xdp() converts customed xdp action to a negative error code with the sk_buff pointer type which be checked with IS_ERR in igb_clean_rx_irq(). Remove this error pointer handing instead use plain int return value. Fixes: 9cbc948 ("igb: add XDP support") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: NipaLocal <nipa@local>

ixgbe_run_xdp() converts customed xdp action to a negative error code with the sk_buff pointer type which be checked with IS_ERR in ixgbe_clean_rx_irq(). Remove this error pointer handing instead use plain int return value. Fixes: 9247080 ("ixgbe: add XDP support for pass and drop actions") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: NipaLocal <nipa@local>

ixgbevf_run_xdp() converts customed xdp action to a negative error code with the sk_buff pointer type which be checked with IS_ERR in ixgbevf_clean_rx_irq(). Remove this error pointer handing instead use plain int return value. Fixes: c7aec59 ("ixgbevf: Add XDP support for pass and drop actions") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: NipaLocal <nipa@local>

In 'taprio_change()', 'admin' pointer may become dangling due to sched switch / removal caused by 'advance_sched()', and critical section protected by 'q->current_entry_lock' is too small to prevent from such a scenario (which causes use-after-free detected by KASAN). Fix this by prefer 'rcu_replace_pointer()' over 'rcu_assign_pointer()' to update 'admin' immediately before an attempt to schedule freeing. Fixes: a3d43c0 ("taprio: Add support adding an admin schedule") Reported-by: syzbot+b65e0af58423fc8a73aa@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Signed-off-by: NipaLocal <nipa@local>

Fix possible use-after-free in 'taprio_dump()' by adding RCU read-side critical section there. Never seen on x86 but found on a KASAN-enabled arm64 system when investigating https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa: [T15862] BUG: KASAN: slab-use-after-free in taprio_dump+0xa0c/0xbb0 [T15862] Read of size 4 at addr ffff0000d4bb88f8 by task repro/15862 [T15862] [T15862] CPU: 0 UID: 0 PID: 15862 Comm: repro Not tainted 6.11.0-rc1-00293-gdefaf1a2113a-dirty kernel-patches#2 [T15862] Hardware name: QEMU QEMU Virtual Machine, BIOS edk2-20240524-5.fc40 05/24/2024 [T15862] Call trace: [T15862] dump_backtrace+0x20c/0x220 [T15862] show_stack+0x2c/0x40 [T15862] dump_stack_lvl+0xf8/0x174 [T15862] print_report+0x170/0x4d8 [T15862] kasan_report+0xb8/0x1d4 [T15862] __asan_report_load4_noabort+0x20/0x2c [T15862] taprio_dump+0xa0c/0xbb0 [T15862] tc_fill_qdisc+0x540/0x1020 [T15862] qdisc_notify.isra.0+0x330/0x3a0 [T15862] tc_modify_qdisc+0x7b8/0x1838 [T15862] rtnetlink_rcv_msg+0x3c8/0xc20 [T15862] netlink_rcv_skb+0x1f8/0x3d4 [T15862] rtnetlink_rcv+0x28/0x40 [T15862] netlink_unicast+0x51c/0x790 [T15862] netlink_sendmsg+0x79c/0xc20 [T15862] __sock_sendmsg+0xe0/0x1a0 [T15862] ____sys_sendmsg+0x6c0/0x840 [T15862] ___sys_sendmsg+0x1ac/0x1f0 [T15862] __sys_sendmsg+0x110/0x1d0 [T15862] __arm64_sys_sendmsg+0x74/0xb0 [T15862] invoke_syscall+0x88/0x2e0 [T15862] el0_svc_common.constprop.0+0xe4/0x2a0 [T15862] do_el0_svc+0x44/0x60 [T15862] el0_svc+0x50/0x184 [T15862] el0t_64_sync_handler+0x120/0x12c [T15862] el0t_64_sync+0x190/0x194 [T15862] [T15862] Allocated by task 15857: [T15862] kasan_save_stack+0x3c/0x70 [T15862] kasan_save_track+0x20/0x3c [T15862] kasan_save_alloc_info+0x40/0x60 [T15862] __kasan_kmalloc+0xd4/0xe0 [T15862] __kmalloc_cache_noprof+0x194/0x334 [T15862] taprio_change+0x45c/0x2fe0 [T15862] tc_modify_qdisc+0x6a8/0x1838 [T15862] rtnetlink_rcv_msg+0x3c8/0xc20 [T15862] netlink_rcv_skb+0x1f8/0x3d4 [T15862] rtnetlink_rcv+0x28/0x40 [T15862] netlink_unicast+0x51c/0x790 [T15862] netlink_sendmsg+0x79c/0xc20 [T15862] __sock_sendmsg+0xe0/0x1a0 [T15862] ____sys_sendmsg+0x6c0/0x840 [T15862] ___sys_sendmsg+0x1ac/0x1f0 [T15862] __sys_sendmsg+0x110/0x1d0 [T15862] __arm64_sys_sendmsg+0x74/0xb0 [T15862] invoke_syscall+0x88/0x2e0 [T15862] el0_svc_common.constprop.0+0xe4/0x2a0 [T15862] do_el0_svc+0x44/0x60 [T15862] el0_svc+0x50/0x184 [T15862] el0t_64_sync_handler+0x120/0x12c [T15862] el0t_64_sync+0x190/0x194 [T15862] [T15862] Freed by task 6192: [T15862] kasan_save_stack+0x3c/0x70 [T15862] kasan_save_track+0x20/0x3c [T15862] kasan_save_free_info+0x4c/0x80 [T15862] poison_slab_object+0x110/0x160 [T15862] __kasan_slab_free+0x3c/0x74 [T15862] kfree+0x134/0x3c0 [T15862] taprio_free_sched_cb+0x18c/0x220 [T15862] rcu_core+0x920/0x1b7c [T15862] rcu_core_si+0x10/0x1c [T15862] handle_softirqs+0x2e8/0xd64 [T15862] __do_softirq+0x14/0x20 Fixes: 18cdd2f ("net/sched: taprio: taprio_dump and taprio_change are protected by rtnl_mutex") Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Signed-off-by: NipaLocal <nipa@local>

npinfo is not used in any of the ndo_netpoll_setup() methods. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Change smp_mb() imediately following a set_bit() with smp_mb__after_atomic(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: NipaLocal <nipa@local>

There are a couple of attributes missing from the 'bitset' attribute-set in the ethtool netlink spec. Add them to the spec. Reported-by: Kory Maincent <kory.maincent@bootlin.com> Closes: https://lore.kernel.org/netdev/20241017180551.1259bf5c@kmaincent-XPS-13-7390/ Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Tested-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: NipaLocal <nipa@local>

It was reported that after resume from suspend a PCI error is logged and connectivity is broken. Error message is: PCI error (cmd = 0x0407, status_errs = 0x0000) The message seems to be a red herring as none of the error bits is set, and the PCI command register value also is normal. Exception handling for a PCI error includes a chip reset what apparently brakes connectivity here. The interrupt status bit triggering the PCI error handling isn't actually used on PCIe chip versions, so it's not clear why this bit is set by the chip. Fix this by ignoring this bit on PCIe chip versions. Fixes: 0e48515 ("r8169: merge with version 8.001.00 of Realtek's r8168 driver") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219388 Tested-by: Atlas Yu <atlas.yu@canonical.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

The notification handling in ynl is currently very simple, using sleep() to wait a period of time and then handling all the buffered messages in a single batch. This patch changes the notification handling so that messages are processed as they are received. This makes it possible to use ynl as a library that supplies notifications in a timely manner. - Change check_ntf() to be a generator that yields 1 notification at a time and blocks until a notification is available. - Use the --sleep parameter to set an alarm and exit when it fires. This means that the CLI has the same interface, but notifications get printed as they are received: ./tools/net/ynl/cli.py --spec <SPEC> --subscribe <TOPIC> [ --sleep <SECS> ] Here is an example python snippet that shows how to use ynl as a library for receiving notifications: ynl = YnlFamily(f"{dir}/rt_route.yaml") ynl.ntf_subscribe('rtnlgrp-ipv4-route') for event in ynl.check_ntf(): handle(event) Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Tested-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: NipaLocal <nipa@local>

…ule() Both gcc-14 and clang-18 report that passing a non-string literal as the format argument of request_module() is potentially insecure. E.g. clang-18 says: .../nf_bpf_link.c:46:24: warning: format string is not a string literal (potentially insecure) [-Wformat-security] 46 | err = request_module(mod); | ^~~ .../kmod.h:25:55: note: expanded from macro 'request_module' 25 | #define request_module(mod...) __request_module(true, mod) | ^~~ .../nf_bpf_link.c:46:24: note: treat the string as an argument to avoid this 46 | err = request_module(mod); | ^ | "%s", .../kmod.h:25:55: note: expanded from macro 'request_module' 25 | #define request_module(mod...) __request_module(true, mod) | ^ It is always the case where the contents of mod is safe to pass as the format argument. That is, in my understanding, it never contains any format escape sequences. But, it seems better to be safe than sorry. And, as a bonus, compiler output becomes less verbose by addressing this issue as suggested by clang-18. No functional change intended. Compile tested only. Signed-off-by: Simon Horman <horms@kernel.org> Reviewed-by: Florian Westphal <fw@strlen.de> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: NipaLocal <nipa@local>

The rplnlh parameter is not used in many functions, so delete it. Signed-off-by: Liu Jing <liujing@cmss.chinamobile.com> Signed-off-by: NipaLocal <nipa@local>

The SMMU engine on HIP09 chip has a hardware issue. SMMU pagetable prefetch features may prefetch and use a invalid PTE even the PTE is valid at that time. This will cause the device trigger fake pagefaults. The solution is to avoid prefetching by adding a SYNC command when smmu mapping a iova. But the performance of nic has a sharp drop. Then we do this workaround, always enable tx bounce buffer, avoid mapping/unmapping on TX path. This issue only affects HNS3, so we always enable tx bounce buffer when smmu enabled to improve performance. Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com> Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

To avoid errors in pgtable prefectch, add a sync command to sync io-pagtable. In the case of large traffic, the TX bounce buffer may be used up. At this point, we go to mapping/unmapping on TX path again. So we added the sync command in driver to avoid hardware issue. Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

When a reset type that is not supported by the driver is input, a reset pending flag bit of the HNAE3_NONE_RESET type is generated in reset_pending. The driver does not have a mechanism to clear this type of error. As a result, the driver considers that the reset is not complete. This patch provides a mechanism to clear the HNAE3_NONE_RESET flag and the parameter of hnae3_ae_ops.set_default_reset_request is verified. The error message: hns3 0000:39:01.0: cmd failed -16 hns3 0000:39:01.0: hclge device re-init failed, VF is disabled! hns3 0000:39:01.0: failed to reset VF stack hns3 0000:39:01.0: failed to reset VF(4) hns3 0000:39:01.0: prepare reset(2) wait done hns3 0000:39:01.0 eth4: already uninitialized Use the crash tool to view struct hclgevf_dev: struct hclgevf_dev { ... default_reset_request = 0x20, reset_level = HNAE3_NONE_RESET, reset_pending = 0x100, reset_type = HNAE3_NONE_RESET, ... }; Fixes: 720bd58 ("net: hns3: add set_default_reset_request in the hnae3_ae_ops") Signed-off-by: Hao Lan <lanhao@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

…o early Currently, the netdev->features is configured in hns3_nic_set_features. As a result, __netdev_update_features considers that there is no feature difference, and the procedures of the real features are missing. Fixes: 2a7556b ("net: hns3: implement ndo_features_check ops for hns3 driver") Signed-off-by: Hao Lan <lanhao@huawei.com> Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

…istent. This patch modifies the implementation of debugfs: When the user process stops unexpectedly, not all data of the file system is read. In this case, the save_buf pointer is not released. When the user process is called next time, save_buf is used to copy the cached data to the user space. As a result, the queried data is inconsistent. To solve this problem, determine whether the function is invoked for the first time based on the value of *ppos. If *ppos is 0, obtain the actual data. Fixes: 5e69ea7 ("net: hns3: refactor the debugfs process") Signed-off-by: Hao Lan <lanhao@huawei.com> Signed-off-by: Guangwei Zhang <zhangwangwei6@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Currently, there is a time window between misc irq enabled and service task inited. If an interrupte is reported at this time, it will cause warning like below: [ 16.324639] Call trace: [ 16.324641] __queue_delayed_work+0xb8/0xe0 [ 16.324643] mod_delayed_work_on+0x78/0xd0 [ 16.324655] hclge_errhand_task_schedule+0x58/0x90 [hclge] [ 16.324662] hclge_misc_irq_handle+0x168/0x240 [hclge] [ 16.324666] __handle_irq_event_percpu+0x64/0x1e0 [ 16.324667] handle_irq_event+0x80/0x170 [ 16.324670] handle_fasteoi_edge_irq+0x110/0x2bc [ 16.324671] __handle_domain_irq+0x84/0xfc [ 16.324673] gic_handle_irq+0x88/0x2c0 [ 16.324674] el1_irq+0xb8/0x140 [ 16.324677] arch_cpu_idle+0x18/0x40 [ 16.324679] default_idle_call+0x5c/0x1bc [ 16.324682] cpuidle_idle_call+0x18c/0x1c4 [ 16.324684] do_idle+0x174/0x17c [ 16.324685] cpu_startup_entry+0x30/0x6c [ 16.324687] secondary_start_kernel+0x1a4/0x280 [ 16.324688] ---[ end trace 6aa0bff672a964aa ]--- So don't auto enable misc vector when request irq.. Fixes: 7be1b9f ("net: hns3: make hclge_service use delayed workqueue") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Currently the misc irq is initialized before reset_timer setup. But it will access the reset_timer in the irq handler. So initialize the reset_timer earlier. Fixes: ff20009 ("net: hns3: remove unnecessary work in hclgevf_main") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

…issue The TQP BAR space is divided into two segments. TQPs 0-1023 and TQPs 1024-1279 are in different BAR space addresses. However, hclge_fetch_pf_reg does not distinguish the tqp space information when reading the tqp space information. When the number of TQPs is greater than 1024, access bar space overwriting occurs. The problem of different segments has been considered during the initialization of tqp.io_base. Therefore, tqp.io_base is directly used when the queue is read in hclge_fetch_pf_reg. The error message: Unable to handle kernel paging request at virtual address ffff800037200000 pc : hclge_fetch_pf_reg+0x138/0x250 [hclge] lr : hclge_get_regs+0x84/0x1d0 [hclge] Call trace: hclge_fetch_pf_reg+0x138/0x250 [hclge] hclge_get_regs+0x84/0x1d0 [hclge] hns3_get_regs+0x2c/0x50 [hns3] ethtool_get_regs+0xf4/0x270 dev_ethtool+0x674/0x8a0 dev_ioctl+0x270/0x36c sock_do_ioctl+0x110/0x2a0 sock_ioctl+0x2ac/0x530 __arm64_sys_ioctl+0xa8/0x100 invoke_syscall+0x4c/0x124 el0_svc_common.constprop.0+0x140/0x15c do_el0_svc+0x30/0xd0 el0_svc+0x1c/0x2c el0_sync_handler+0xb0/0xb4 el0_sync+0x168/0x180 Fixes: 939ccd1 ("net: hns3: move dump regs function to a separate file") Signed-off-by: Hao Lan <lanhao@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Currently, HIP08 devices does not register the ptp devices, so the hdev->ptp is NULL. But the tx process would still try to set hardware time stamp info with SKBTX_HW_TSTAMP flag and cause a kernel crash. [ 128.087798] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018 ... [ 128.280251] pc : hclge_ptp_set_tx_info+0x2c/0x140 [hclge] [ 128.286600] lr : hclge_ptp_set_tx_info+0x20/0x140 [hclge] [ 128.292938] sp : ffff800059b93140 [ 128.297200] x29: ffff800059b93140 x28: 0000000000003280 [ 128.303455] x27: ffff800020d48280 x26: ffff0cb9dc814080 [ 128.309715] x25: ffff0cb9cde93fa0 x24: 0000000000000001 [ 128.315969] x23: 0000000000000000 x22: 0000000000000194 [ 128.322219] x21: ffff0cd94f986000 x20: 0000000000000000 [ 128.328462] x19: ffff0cb9d2a166c0 x18: 0000000000000000 [ 128.334698] x17: 0000000000000000 x16: ffffcf1fc523ed24 [ 128.340934] x15: 0000ffffd530a518 x14: 0000000000000000 [ 128.347162] x13: ffff0cd6bdb31310 x12: 0000000000000368 [ 128.353388] x11: ffff0cb9cfbc7070 x10: ffff2cf55dd11e02 [ 128.359606] x9 : ffffcf1f85a212b4 x8 : ffff0cd7cf27dab0 [ 128.365831] x7 : 0000000000000a20 x6 : ffff0cd7cf27d000 [ 128.372040] x5 : 0000000000000000 x4 : 000000000000ffff [ 128.378243] x3 : 0000000000000400 x2 : ffffcf1f85a21294 [ 128.384437] x1 : ffff0cb9db520080 x0 : ffff0cb9db500080 [ 128.390626] Call trace: [ 128.393964] hclge_ptp_set_tx_info+0x2c/0x140 [hclge] [ 128.399893] hns3_nic_net_xmit+0x39c/0x4c4 [hns3] [ 128.405468] xmit_one.constprop.0+0xc4/0x200 [ 128.410600] dev_hard_start_xmit+0x54/0xf0 [ 128.415556] sch_direct_xmit+0xe8/0x634 [ 128.420246] __dev_queue_xmit+0x224/0xc70 [ 128.425101] dev_queue_xmit+0x1c/0x40 [ 128.429608] ovs_vport_send+0xac/0x1a0 [openvswitch] [ 128.435409] do_output+0x60/0x17c [openvswitch] [ 128.440770] do_execute_actions+0x898/0x8c4 [openvswitch] [ 128.446993] ovs_execute_actions+0x64/0xf0 [openvswitch] [ 128.453129] ovs_dp_process_packet+0xa0/0x224 [openvswitch] [ 128.459530] ovs_vport_receive+0x7c/0xfc [openvswitch] [ 128.465497] internal_dev_xmit+0x34/0xb0 [openvswitch] [ 128.471460] xmit_one.constprop.0+0xc4/0x200 [ 128.476561] dev_hard_start_xmit+0x54/0xf0 [ 128.481489] __dev_queue_xmit+0x968/0xc70 [ 128.486330] dev_queue_xmit+0x1c/0x40 [ 128.490856] ip_finish_output2+0x250/0x570 [ 128.495810] __ip_finish_output+0x170/0x1e0 [ 128.500832] ip_finish_output+0x3c/0xf0 [ 128.505504] ip_output+0xbc/0x160 [ 128.509654] ip_send_skb+0x58/0xd4 [ 128.513892] udp_send_skb+0x12c/0x354 [ 128.518387] udp_sendmsg+0x7a8/0x9c0 [ 128.522793] inet_sendmsg+0x4c/0x8c [ 128.527116] __sock_sendmsg+0x48/0x80 [ 128.531609] __sys_sendto+0x124/0x164 [ 128.536099] __arm64_sys_sendto+0x30/0x5c [ 128.540935] invoke_syscall+0x50/0x130 [ 128.545508] el0_svc_common.constprop.0+0x10c/0x124 [ 128.551205] do_el0_svc+0x34/0xdc [ 128.555347] el0_svc+0x20/0x30 [ 128.559227] el0_sync_handler+0xb8/0xc0 [ 128.563883] el0_sync+0x160/0x180 Fixes: 0bf5eb7 ("net: hns3: add support for PTP") Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Add devlink_nl_put_u64() that abstracts padding for u64 values. All u64 values are passed with the very same padding option. Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: NipaLocal <nipa@local>

Use devlink_nl_put_u64() shortcut added by prev commit on all devlink/. Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: NipaLocal <nipa@local>

Differentiate error codes of devl_resource_register(). Replace one of -EINVAL exit paths by -EEXIST. This should aid developers introducing new resources and registering them in the wrong order. Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: NipaLocal <nipa@local>

Consolidate error codes for too big message size. Current code is written to return -EINVAL when tailroom in the skb msg would be exhausted precisely when it's time to nest, and return -EMSGSIZE in all other "not enough space" conditions. Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: NipaLocal <nipa@local>

Replace devlink_resource_register(), devlink_resource_occ_get_register(), and devlink_resource_occ_get_unregister() calls by respective devl_* variants. Mentioned functions have no direct users in any drivers, and are going to be removed in subsequent patches. Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: NipaLocal <nipa@local>

…ister() Remove not used devlink_resource_occ_get_register() and devlink_resource_occ_get_unregister() functions; current devlink resource users are fine with devl_ variants of the two. Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: NipaLocal <nipa@local>

Remove unused devlink_resource_register(); all the drivers use devl_resource_register() variant instead. Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: NipaLocal <nipa@local>

Set the initial rec_seq to 0xffffffffffffffff so that it wraps immediately. The send() call should fail with EBADMSG. A bug in this code was fixed in commit cfaa80c ("net/tls: do not free tls_rec on async operation in bpf_exec_tx_verdict()"). Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

The testing is done by ensuring that the fragment allocated from a frag_frag_cache instance is pushed into a ptr_ring instance in a kthread binded to a specified cpu, and a kthread binded to a specified cpu will pop the fragment from the ptr_ring and free the fragment. CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: NipaLocal <nipa@local>

Inspired by [1], move the page fragment allocator from page_alloc into its own c file and header file, as we are about to make more change for it to replace another page_frag implementation in sock.c As this patchset is going to replace 'struct page_frag' with 'struct page_frag_cache' in sched.h, including page_frag_cache.h in sched.h has a compiler error caused by interdependence between mm_types.h and mm.h for asm-offsets.c, see [2]. So avoid the compiler error by moving 'struct page_frag_cache' to mm_types_task.h as suggested by Alexander, see [3]. 1. https://lore.kernel.org/all/20230411160902.4134381-3-dhowells@redhat.com/ 2. https://lore.kernel.org/all/15623dac-9358-4597-b3ee-3694a5956920@gmail.com/ 3. https://lore.kernel.org/all/CAKgT0UdH1yD=LSCXFJ=YM_aiA4OomD-2wXykO42bizaWMt_HOA@mail.gmail.com/ CC: David Howells <dhowells@redhat.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: NipaLocal <nipa@local>

We are about to use page_frag_alloc_*() API to not just allocate memory for skb->data, but also use them to do the memory allocation for skb frag too. Currently the implementation of page_frag in mm subsystem is running the offset as a countdown rather than count-up value, there may have several advantages to that as mentioned in [1], but it may have some disadvantages, for example, it may disable skb frag coalescing and more correct cache prefetching We have a trade-off to make in order to have a unified implementation and API for page_frag, so use a initial zero offset in this patch, and the following patch will try to make some optimization to avoid the disadvantages as much as possible. 1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@gmail.com/ CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: NipaLocal <nipa@local>

Use appropriate frag_page API instead of caller accessing 'page_frag_cache' directly. CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: NipaLocal <nipa@local>

As the get_order() implemented by xtensa supporting 'nsau' instruction seems be the same as the generic implementation in include/asm-generic/getorder.h when size is not a constant value as the generic implementation calling the fls*() is also utilizing the 'nsau' instruction for xtensa. So remove the get_order() implemented by xtensa, as using the generic implementation may enable the compiler to do the computing when size is a constant value instead of runtime computing and enable the using of get_order() in BUILD_BUG_ON() macro in next patch. CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: NipaLocal <nipa@local>

Currently there is one 'struct page_frag' for every 'struct sock' and 'struct task_struct', we are about to replace the 'struct page_frag' with 'struct page_frag_cache' for them. Before begin the replacing, we need to ensure the size of 'struct page_frag_cache' is not bigger than the size of 'struct page_frag', as there may be tens of thousands of 'struct sock' and 'struct task_struct' instances in the system. By or'ing the page order & pfmemalloc with lower bits of 'va' instead of using 'u16' or 'u32' for page size and 'u8' for pfmemalloc, we are able to avoid 3 or 5 bytes space waste. And page address & pfmemalloc & order is unchanged for the same page in the same 'page_frag_cache' instance, it makes sense to fit them together. After this patch, the size of 'struct page_frag_cache' should be the same as the size of 'struct page_frag'. CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: NipaLocal <nipa@local>

Refactor common codes from __page_frag_alloc_va_align() to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the new API can make use of them. CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: NipaLocal <nipa@local>

It seems there is about 24Bytes binary size increase for __page_frag_cache_refill() after refactoring in arm64 system with 64K PAGE_SIZE. By doing the gdb disassembling, It seems we can have more than 100Bytes decrease for the binary size by using __alloc_pages() to replace alloc_pages_node(), as there seems to be some unnecessary checking for nid being NUMA_NO_NODE, especially when page_frag is part of the mm system. CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: NipaLocal <nipa@local>

Rename skb_copy_to_page_nocache() to skb_copy_to_frag_nocache() to avoid calling virt_to_page() as we are about to pass virtual address directly. CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: NipaLocal <nipa@local>

There are many use cases that need minimum memory in order for forward progress, but more performant if more memory is available or need to probe the cache info to use any memory available for frag caoleasing reason. Currently skb_page_frag_refill() API is used to solve the above use cases, but caller needs to know about the internal detail and access the data field of 'struct page_frag' to meet the requirement of the above use cases and its implementation is similar to the one in mm subsystem. To unify those two page_frag implementations, introduce a prepare API to ensure minimum memory is satisfied and return how much the actual memory is available to the caller and a probe API to report the current available memory to caller without doing cache refilling. The caller needs to either call the commit API to report how much memory it actually uses, or not do so if deciding to not use any memory. CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Add testing for the newly added prepare API, for both aligned and non-aligned API, also probe API is also tested along with prepare API. CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Use the newly introduced prepare/probe/commit API to replace page_frag with page_frag_cache for sk_page_frag(). CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Update documentation about design, implementation and API usages for page_frag. CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: NipaLocal <nipa@local>

After this patchset, page_frag is a small subsystem/library on its own, so add an entry in MAINTAINERS to indicate the new subsystem/library's maintainer, maillist, status and file lists of page_frag. Alexander is the original author of page_frag, add him in the MAINTAINERS too. CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: NipaLocal <nipa@local>

Use the core_stats rx_dropped counter to avoid the cost of atomic increments. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: NipaLocal <nipa@local>

The vlan_qos member is used to save the vlan qos, but we only save the priority. Also, we will get the priority in vlan netlink and proc. We can just save the vlan priority using vlan_prio, so we can use vlan_prio to get the priority directly. For flexibility, we introduced vlan_dev_get_egress_priority() helper function. After this patch, we will call vlan_dev_get_egress_priority() instead of vlan_dev_get_egress_qos_mask() in irdma.ko and rdma_cm.ko. Because we don't need the shift and mask operations anymore. There is no functional changes. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: NipaLocal <nipa@local>

rtl8152_system_resume() issues a synchronous usb reset if the device is inaccessible. __rtl8152_set_mac_address() is called via rtl8152_post_reset() and it tries to take the same mutex that was already taken in rtl8152_resume(). Move the call to reset usb in rtl8152_resume() outside mutex protection. Signed-off-by: George-Daniel Matei <danielgeorgem@chromium.org> Signed-off-by: NipaLocal <nipa@local>

This gets referenced by the ip6_mr_cleanup function, so it must not be discarded early: WARNING: modpost: vmlinux: section mismatch in reference: ip6_mr_cleanup+0x14 (section: .exit.text) -> ip6mr_rtnl_msg_handlers (section: .init.rodata) ERROR: modpost: Section mismatches detected. Set CONFIG_SECTION_MISMATCH_WARN_ONLY=y to allow them. Fixes: 3ac84e3 ("ipmr: Use rtnl_register_many().") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NipaLocal <nipa@local>

These were added with the wrong family in 4cdc55e, which seems to just have been a typo, but now ip6tables rules with --set-mark don't work anymore, which is pretty bad. Fixes: 4cdc55ec6222 ("netfilter: xtables: avoid NFPROTO_UNSPEC where needed") Signed-off-by: Ilya Katsnelson <me@0upti.me> Reviewed-by: Phil Sutter <phil@nwl.cc> Signed-off-by: NipaLocal <nipa@local>

For the nflog_tg_reg struct under the CONFIG_IP6_NF_IPTABLES switch family should probably be NFPROTO_IPV6 Fixes: 0bfcb7b ("netfilter: xtables: avoid NFPROTO_UNSPEC where needed") Cc: stable@vger.kernel.org Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: NipaLocal <nipa@local>

The well-known errata regarding EEE not being functional on various KSZ switches has been refactored a few times. Recently the refactoring has excluded several switches that the errata should also apply to. Disable EEE for additional switches with this errata and provide additional comments referring to the public errata document. The original workaround for the errata was applied with a register write to manually disable the EEE feature in MMD 7:60 which was being applied for KSZ9477/KSZ9897/KSZ9567 switch ID's. Then came commit 26dd297 ("net: phy: micrel: Move KSZ9477 errata fixes to PHY driver") and commit 6068e6d ("net: dsa: microchip: remove KSZ9477 PHY errata handling") which moved the errata from the switch driver to the PHY driver but only for PHY_ID_KSZ9477 (PHY ID) however that PHY code was dead code because an entry was never added for PHY_ID_KSZ9477 via MODULE_DEVICE_TABLE. This was apparently realized much later and commit 54a4e5c ("net: phy: micrel: add Microchip KSZ 9477 to the device table") added the PHY_ID_KSZ9477 to the PHY driver but as the errata was only being applied to PHY_ID_KSZ9477 it's not completely clear what switches that relates to. Later commit 6149db4 ("net: phy: micrel: fix KSZ9477 PHY issues after suspend/resume") breaks this again for all but KSZ9897 by only applying the errata for that PHY ID. Following that this was affected with commit 08c6d8b("net: phy: Provide Module 4 KSZ9477 errata (DS80000754C)") which removes the blatant register write to MMD 7:60 and replaces it by setting phydev->eee_broken_modes = -1 so that the generic phy-c45 code disables EEE but this is only done for the KSZ9477_CHIP_ID (Switch ID). Lastly commit 0411f73 ("net: dsa: microchip: disable EEE for KSZ8567/KSZ9567/KSZ9896/KSZ9897.") adds some additional switches that were missing to the errata due to the previous changes. This commit adds an additional set of switches. Fixes: 0411f73 ("net: dsa: microchip: disable EEE for KSZ8567/KSZ9567/KSZ9896/KSZ9897.") Signed-off-by: Tim Harvey <tharvey@gateworks.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: NipaLocal <nipa@local>

Link IRQs to NAPI instances via netdev-genl API so that users can query this information with netlink. Compare the output of /proc/interrupts (noting that IRQ 144 is the "other" IRQ which does not appear to have a NAPI instance): $ cat /proc/interrupts | grep enp86s0 | cut --delimiter=":" -f1 128 129 130 131 132 The output from netlink shows the mapping of NAPI IDs to IRQs (again noting that 144 is absent as it is the "other" IRQ): $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 2}' [{'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 8196, 'ifindex': 2, 'irq': 132}, {'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 8195, 'ifindex': 2, 'irq': 131}, {'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 8194, 'ifindex': 2, 'irq': 130}, {'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 8193, 'ifindex': 2, 'irq': 129}] Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: NipaLocal <nipa@local>

Link queues to NAPI instances via netdev-genl API so that users can query this information with netlink. Handle a few cases in the driver: 1. Link/unlink the NAPIs when XDP is enabled/disabled 2. Handle IGC_FLAG_QUEUE_PAIRS enabled and disabled Example output when IGC_FLAG_QUEUE_PAIRS is enabled: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump queue-get --json='{"ifindex": 2}' [{'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'rx'}, {'id': 1, 'ifindex': 2, 'napi-id': 8194, 'type': 'rx'}, {'id': 2, 'ifindex': 2, 'napi-id': 8195, 'type': 'rx'}, {'id': 3, 'ifindex': 2, 'napi-id': 8196, 'type': 'rx'}, {'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'tx'}, {'id': 1, 'ifindex': 2, 'napi-id': 8194, 'type': 'tx'}, {'id': 2, 'ifindex': 2, 'napi-id': 8195, 'type': 'tx'}, {'id': 3, 'ifindex': 2, 'napi-id': 8196, 'type': 'tx'}] Since IGC_FLAG_QUEUE_PAIRS is enabled, you'll note that the same NAPI ID is present for both rx and tx queues at the same index, for example index 0: {'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'rx'}, {'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'tx'}, To test IGC_FLAG_QUEUE_PAIRS disabled, a test system was booted using the grub command line option "maxcpus=2" to force igc_set_interrupt_capability to disable IGC_FLAG_QUEUE_PAIRS. Example output when IGC_FLAG_QUEUE_PAIRS is disabled: $ lscpu | grep "On-line CPU" On-line CPU(s) list: 0,2 $ ethtool -l enp86s0 | tail -5 Current hardware settings: RX: n/a TX: n/a Other: 1 Combined: 2 $ cat /proc/interrupts | grep enp 144: [...] enp86s0 145: [...] enp86s0-rx-0 146: [...] enp86s0-rx-1 147: [...] enp86s0-tx-0 148: [...] enp86s0-tx-1 1 "other" IRQ, and 2 IRQs for each of RX and Tx, so we expect netlink to report 4 IRQs with unique NAPI IDs: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 2}' [{'id': 8196, 'ifindex': 2, 'irq': 148}, {'id': 8195, 'ifindex': 2, 'irq': 147}, {'id': 8194, 'ifindex': 2, 'irq': 146}, {'id': 8193, 'ifindex': 2, 'irq': 145}] Now we examine which queues these NAPIs are associated with, expecting that since IGC_FLAG_QUEUE_PAIRS is disabled each RX and TX queue will have its own NAPI instance: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump queue-get --json='{"ifindex": 2}' [{'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'rx'}, {'id': 1, 'ifindex': 2, 'napi-id': 8194, 'type': 'rx'}, {'id': 0, 'ifindex': 2, 'napi-id': 8195, 'type': 'tx'}, {'id': 1, 'ifindex': 2, 'napi-id': 8196, 'type': 'tx'}] Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: NipaLocal <nipa@local>

Add support for reading SFP module info and digital diagnostic monitoring data if supported by the module. The only Aquantia controller without an integrated PHY is the AQC100 which belongs to the B0 revision, that's why it's only implemented there. The register information was extracted from a diagnostic tool made publicly available by Dell, but all code was written from scratch by me. This has been tested to work with a variety of both optical and direct attach modules I had lying around and seems to work fine with all of them, including the diagnostics if supported by an optical module. All tests have been done with an AQC100 on an TL-NT521F card on firmware version 3.1.121 (current at the time of this patch). Signed-off-by: Lorenz Brun <lorenz@brun.one> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

The existing code moves VF to the same namespace as the synthetic NIC during netvsc_register_vf(). But, if the synthetic device is moved to a new namespace after the VF registration, the VF won't be moved together. To make the behavior more consistent, add a namespace check for synthetic NIC's NETDEV_REGISTER event (generated during its move), and move the VF if it is not in the same namespace. Cc: stable@vger.kernel.org Fixes: c0a41b8 ("hv_netvsc: move VF to same namespace as netvsc device") Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Allows simplifying get_strings and avoids manual pointer manipulation. Signed-off-by: Rosen Penev <rosenp@gmail.com> Signed-off-by: NipaLocal <nipa@local>

…git/bpf/bpf BPF fixes: - Fix BPF verifier to not affect subreg_def marks in its range propagation, from Eduard Zingerman. - Fix a truncation bug in the BPF verifier's handling of coerce_reg_to_size_sx, from Dimitar Kanaliev. - Fix the BPF verifier's delta propagation between linked registers under 32-bit addition, from Daniel Borkmann. - Fix a NULL pointer dereference in BPF devmap due to missing rxq information, from Florian Kauer. - Fix a memory leak in bpf_core_apply, from Jiri Olsa. - Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for arrays of nested structs, from Hou Tao. - Fix build ID fetching where memory areas backing the file were created with memfd_secret, from Andrii Nakryiko. - Fix BPF task iterator tid filtering which was incorrectly using pid instead of tid, from Jordan Rome. - Several fixes for BPF sockmap and BPF sockhash redirection in combination with vsocks, from Michal Luczaj. - Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered, from Andrea Parri. - Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility of an infinite BPF tailcall, from Pu Lehui. - Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot be resolved, from Thomas Weißschuh. - Fix a bug in kfunc BTF caching for modules where the wrong BTF object was returned, from Toke Høiland-Jørgensen. - Fix a BPF selftest compilation error in cgroup-related tests with musl libc, from Tony Ambardar. - Several fixes to BPF link info dumps to fill missing fields, from Tyrone Wu. - Add BPF selftests for kfuncs from multiple modules, checking that the correct kfuncs are called, from Simon Sundberg. - Ensure that internal and user-facing bpf_redirect flags don't overlap, also from Toke Høiland-Jørgensen. - Switch to use kvzmalloc to allocate BPF verifier environment, from Rik van Riel. - Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat under RT, from Wander Lairson Costa. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: NipaLocal <nipa@local> # -----BEGIN PGP SIGNATURE----- # # iIsEABYIADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZxK4OhUcZGFuaWVsQGlv # Z2VhcmJveC5uZXQACgkQ2yufC7HISIOCrwEAib2kC5EEQn5+wKVE/bnZryVX2leT # YXdfItDCBU6zCYUA+wTU5hGGn9lcDUcZx72l/KZPDyPw7HdzNJ+6iR1zQqoM # =f9kv # -----END PGP SIGNATURE----- # gpg: Signature made Fri 18 Oct 2024 12:34:18 PM PDT # gpg: using EDDSA key C5A742358EA66B017FA13D15DB2B9F0BB1C84883 # gpg: issuer "daniel@iogearbox.net" # gpg: Can't check signature: No public key

Introduce new variables plat_stmmacenet_data::snps_id,dev_id to allow glue drivers to specify synopsys ID and device id respectively. These values take precedence over reading from HW register. This facility provides a mechansim to use setup function from stmmac core module and yet override MAC.VERSION CSR if the glue driver chooses to do so. Signed-off-by: Jitendra Vegiraju <jitendra.vegiraju@broadcom.com> Signed-off-by: NipaLocal <nipa@local>

The BCM8958x uses early adopter version of DWC_xgmac version 4.00a for Ethernet MAC. The DW25GMAC introduced in this version adds new DMA architecture called Hyper-DMA (HDMA) for virtualization scalability. This is realized by decoupling physical DMA channels(PDMA) from potentially large number of virtual DMA channels (VDMA). The VDMAs are software abstractions that map to PDMAs for frame transmission and reception. Define new macros DW25GMAC_CORE_4_00 and DW25GMAC_ID to identify DW25GMAC device. To support the new HDMA architecture, a new instance of stmmac_dma_ops dw25gmac400_dma_ops is added. To support the current needs, a simple one-to-one mapping of dw25gmac's logical VDMA (channel) to TC to PDMAs is used. Most of the other dma operation functions in existing dwxgamc2_dma.c file are reused where applicable. Added setup function for DW25GMAC's stmmac_hwif_entry in stmmac core. Signed-off-by: Jitendra Vegiraju <jitendra.vegiraju@broadcom.com> Signed-off-by: NipaLocal <nipa@local>

Integrate dw25gmac support into stmmac hardware interface handling. Added a new entry to the stmmac_hw table in hwif.c. Since BCM8958x is an early adopter device, the synopsis_id reported in HW is 0x32 and device_id is DWXGMAC_ID. Provide override support by giving preference to snps_id, dev_id values when initialized to non-zero values by glue driver. Signed-off-by: Jitendra Vegiraju <jitendra.vegiraju@broadcom.com> Signed-off-by: NipaLocal <nipa@local>

Add PCI ethernet driver support for Broadcom BCM8958x SoC devices used in automotive applications. This SoC device has PCIe ethernet MAC attached to an integrated ethernet switch using XGMII interface. The PCIe ethernet controller is presented to the Linux host as PCI network device. The following block diagram gives an overview of the application. +=================================+ | Host CPU/Linux | +=================================+ || PCIe || +==========================================+ | +--------------+ | | | PCIE Endpoint| | | | Ethernet | | | | Controller | | | | DMA | | | +--------------+ | | | MAC | BCM8958X | | +--------------+ SoC | | || XGMII | | || | | +--------------+ | | | Ethernet | | | | switch | | | +--------------+ | | || || || || | +==========================================+ || || || || More external interfaces The MAC IP block on BCM8958x is based on Synopsis XGMAC 4.00a core. This driver uses common dwxgmac2 code where applicable. Driver functionality specific to this MAC is implemented in dw25gmac.c. Management of integrated ethernet switch on this SoC is not handled by the PCIe interface. Since BCM8958x is an early adopter device, override the hardware reported synopsis versions with actual DW25MAC versions that support hdma. This SoC device has PCIe ethernet MAC directly attached to an integrated ethernet switch using XGMII interface. Since device tree support is not available on this platform, a software node is created to enable fixed-link support using phylink driver. Signed-off-by: Jitendra Vegiraju <jitendra.vegiraju@broadcom.com> Signed-off-by: NipaLocal <nipa@local>

Add PCI driver for BCM8958x to the linux build system and update MAINTAINERS file. Signed-off-by: Jitendra Vegiraju <jitendra.vegiraju@broadcom.com> Signed-off-by: NipaLocal <nipa@local>

Add Fibocom FG132 0x0112 composition: T: Bus=03 Lev=02 Prnt=06 Port=01 Cnt=02 Dev#= 10 Spd=12 MxCh= 0 D: Ver= 2.01 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1 P: Vendor=2cb7 ProdID=0112 Rev= 5.15 S: Manufacturer=Fibocom Wireless Inc. S: Product=Fibocom Module S: SerialNumber=xxxxxxxx C:* #Ifs= 4 Cfg#= 1 Atr=a0 MxPwr=500mA I:* If#= 0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan E: Ad=82(I) Atr=03(Int.) MxPS= 8 Ivl=32ms E: Ad=81(I) Atr=02(Bulk) MxPS= 64 Ivl=0ms E: Ad=01(O) Atr=02(Bulk) MxPS= 64 Ivl=0ms I:* If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option E: Ad=02(O) Atr=02(Bulk) MxPS= 64 Ivl=0ms E: Ad=83(I) Atr=02(Bulk) MxPS= 64 Ivl=0ms I:* If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option E: Ad=85(I) Atr=03(Int.) MxPS= 10 Ivl=32ms E: Ad=84(I) Atr=02(Bulk) MxPS= 64 Ivl=0ms E: Ad=03(O) Atr=02(Bulk) MxPS= 64 Ivl=0ms I:* If#= 3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=00 Prot=00 Driver=option E: Ad=86(I) Atr=02(Bulk) MxPS= 64 Ivl=0ms E: Ad=04(O) Atr=02(Bulk) MxPS= 64 Ivl=0ms Signed-off-by: Reinhard Speyerer <rspmn@arcor.de> Signed-off-by: NipaLocal <nipa@local>

Currently we disable PCS ANE when the link speed is 2.5Gbps. mac_link_up callback internally calls the fix_mac_speed which internally calls stmmac_pcs_ctrl_ane to disable the ANE for 2.5Gbps. We observed that the CPU utilization is pretty high. That is because we saw that the PCS interrupt status line for Link and AN always remain asserted. Since we are disabling the PCS ANE for 2.5Gbps it makes sense to also disable the PCS link status and AN complete in the interrupt enable register. Interrupt storm Issue:- [ 25.465754][ C2] stmmac_pcs: Link Down [ 25.469888][ C2] stmmac_pcs: Link Down [ 25.474030][ C2] stmmac_pcs: Link Down [ 25.478164][ C2] stmmac_pcs: Link Down [ 25.482305][ C2] stmmac_pcs: Link Down [ 25.486441][ C2] stmmac_pcs: Link Down [ 25.486635][ C4] watchdog0: pretimeout event [ 25.490585][ C2] stmmac_pcs: Link Down [ 25.499341][ C2] stmmac_pcs: Link Down [ 25.503484][ C2] stmmac_pcs: Link Down [ 25.507619][ C2] stmmac_pcs: Link Down [ 25.511760][ C2] stmmac_pcs: Link Down [ 25.515897][ C2] stmmac_pcs: Link Down [ 25.520038][ C2] stmmac_pcs: Link Down [ 25.524174][ C2] stmmac_pcs: Link Down [ 25.528316][ C2] stmmac_pcs: Link Down [ 25.532451][ C2] stmmac_pcs: Link Down [ 25.536591][ C2] stmmac_pcs: Link Down [ 25.540724][ C2] stmmac_pcs: Link Down [ 25.544866][ C2] stmmac_pcs: Link Down Once we disabled PCS ANE and Link Status interrupt issue disappears. Fixes: a818bd1 ("net: stmmac: dwmac-qcom-ethqos: Add support for 2.5G SGMII") Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Signed-off-by: NipaLocal <nipa@local>

DualPI2 provides L4S-type low latency & loss to traffic that uses a scalable congestion controller (e.g. TCP-Prague, DCTCP) without degrading the performance of 'classic' traffic (e.g. Reno, Cubic etc.). It is intended to be the reference implementation of the IETF's DualQ Coupled AQM. The qdisc provides two queues called low latency and classic. It classifies packets based on the ECN field in the IP headers. By default it directs non-ECN and ECT(0) into the classic queue and ECT(1) and CE into the low latency queue, as per the IETF spec. Each queue runs its own AQM: * The classic AQM is called PI2, which is similar to the PIE AQM but more responsive and simpler. Classic traffic requires a decent target queue (default 15ms for Internet deployment) to fully utilize the link and to avoid high drop rates. * The low latency AQM is, by default, a very shallow ECN marking threshold (1ms) similar to that used for DCTCP. The DualQ isolates the low queuing delay of the Low Latency queue from the larger delay of the 'Classic' queue. However, from a bandwidth perspective, flows in either queue will share out the link capacity as if there was just a single queue. This bandwidth pooling effect is achieved by coupling together the drop and ECN-marking probabilities of the two AQMs. The PI2 AQM has two main parameters in addition to its target delay. All the defaults are suitable for any Internet setting, but it can be reconfigured for a Data Centre setting. The integral gain factor alpha is used to slowly correct any persistent standing queue error from the target delay, while the proportional gain factor beta is used to quickly compensate for queue changes (growth or shrinkage). Either alpha and beta are given as a parameter, or they can be calculated by tc from alternative typical and maximum RTT parameters. Internally, the output of a linear Proportional Integral (PI) controller is used for both queues. This output is squared to calculate the drop or ECN-marking probability of the classic queue. This counterbalances the square-root rate equation of Reno/Cubic, which is the trick that balances flow rates across the queues. For the ECN-marking probability of the low latency queue, the output of the base AQM is multiplied by a coupling factor. This determines the balance between the flow rates in each queue. The default setting makes the flow rates roughly equal, which should be generally applicable. If DUALPI2 AQM has detected overload (due to excessive non-responsive traffic in either queue), it will switch to signaling congestion solely using drop, irrespective of the ECN field. Alternatively, it can be configured to limit the drop probability and let the queue grow and eventually overflow (like tail-drop). Additional details can be found in the draft: https://datatracker.ietf.org/doc/html/rfc9332 Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> Co-developed-by: Olga Albisser <olga@albisser.org> Signed-off-by: Olga Albisser <olga@albisser.org> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Co-developed-by: Henrik Steen <henrist@henrist.net> Signed-off-by: Henrik Steen <henrist@henrist.net> Signed-off-by: Bob Briscoe <research@bobbriscoe.net> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

- Move tcp_count_delivered() earlier and split tcp_count_delivered_ce() out of it - Move tcp_in_ack_event() later - While at it, remove the inline from tcp_in_ack_event() and let the compiler to decide Accurate ECN's heuristics does not know if there is going to be ACE field based CE counter increase or not until after rtx queue has been processed. Only then the number of ACKed bytes/pkts is available. As CE or not affects presence of FLAG_ECE, that information for tcp_in_ack_event is not yet available in the old location of the call to tcp_in_ack_event(). Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

Whenever timestamp advances, it declares progress which can be used by the other parts of the stack to decide that the ACK is the most recent one seen so far. AccECN will use this flag when deciding whether to use the ACK to update AccECN state or not. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

Use BIT() macro for TCP flags field and TCP congestion control flags that will be used by the congestion control algorithm. No functional changes. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: NipaLocal <nipa@local>

With AccECN, there's one additional TCP flag to be used (AE) and ACE field that overloads the definition of AE, CWR, and ECE flags. As tcp_flags was previously only 1 byte, the byte-order stuff needs to be added to it's handling. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

Prepare for AccECN that needs to have access here on IP ECN field value which is only available after INET_ECN_xmit(). No functional changes. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is called only for data segments, not for ACKs (with AccECN, also ACKs may get ECN bits). The extra "layer" in tcp_ecn_check_ce() function just checks for ECN being enabled, that can be moved into tcp_ecn_field_check rather than having the __ variant. No functional changes. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

Create helpers for TCP ECN modes. No functional changes. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

Handling the CWR flag differs between RFC 3168 ECN and AccECN. With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared starting from 2nd segment which is incompatible how AccECN handles the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN. With AccECN, CWR flag (or more accurately, the ACE field that also includes ECE & AE flags) changes only when new packet(s) with CE mark arrives so the flag should not be changed within a super-skb. The new skb/feature flags are necessary to prevent such TSO engines corrupting AccECN ACE counters by clearing the CWR flag (if the CWR handling feature cannot be turned off). If NIC is completely unaware of RFC3168 ECN (doesn't support NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag despite supporting also NETIF_F_TSO_ECN, TSO could be safely used with AccECN on such NIC. This should be evaluated per NIC basis (not done in this patch series for any NICs). For the cases, where TSO cannot keep its hands off the CWR flag, a GSO fallback is provided by this patch. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

There are important differences in how the CWR field behaves in RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter and its changes are important so adjust the flags changed mask accordingly. Also, if CWR is there, set the Accurate ECN GSO flag to avoid corrupting CWR flag somewhere. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

AE flag needs to be preserved for AccECN. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

AccECN connection's last ACK cannot retain ECT(1) as the bits are always cleared causing the packet to switch into another service queue. This effectively adds a finer-grained filtering for ECN bits so that acceptable TW ACKs can retain the bits. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

Accurate ECN needs to send custom flags to handle IP-ECN field reflection during handshake. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

The following patch will use tcp_ecn_mode_accecn(), TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in __tcp_fast_path_on() to make new flag for AccECN. No functional changes. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chai-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

Add SYSCTL_FIVE for new AccECN feedback modes of net.ipv4.tcp_ecn. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: NipaLocal <nipa@local>

KMSAN detects uninitialized memory stored to memory by bpf_clone_redirect(). Adding a check to the transmission path to find malformed headers prevents this issue. Specifically, we check if the length of the data stored in skb is less than the minimum device header length. If so, drop the packet since the skb cannot contain a valid device header. Also check if mac_header_len(skb) is outside the range provided of valid device header lengths. Testing this patch with syzbot removes the bug. Fixes: 8826498 ("Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext") Reported-by: syzbot+346474e3bf0b26bd3090@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=346474e3bf0b26bd3090 Signed-off-by: Daniel Yang <danielyangkang@gmail.com> Signed-off-by: NipaLocal <nipa@local>

'ports_fwnode' is initialized via device_get_named_child_node(), which requires a call to fwnode_handle_put() when the variable is no longer required to avoid leaking memory. Add the missing fwnode_handle_put() after 'ports_fwnode' has been used and is no longer required. Fixes: 94a2a84 ("net: dsa: mv88e6xxx: Support LED control") Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: NipaLocal <nipa@local>

Instead of having them as individual fields in ptp_ops, wrap the coefficients in a separate struct so they can be referenced together. Fixes: de776d0 ("net: dsa: mv88e6xxx: add support for mv88e6393x family") Signed-off-by: Shenghao Yang <me@shenghaoyang.info> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: NipaLocal <nipa@local>

Instead of relying on a fixed mapping of hardware family to cycle counter frequency, pull this information from the MV88E6XXX_TAI_CLOCK_PERIOD register. This lets us support switches whose cycle counter frequencies depend on board design. Fixes: de776d0 ("net: dsa: mv88e6xxx: add support for mv88e6393x family") Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Shenghao Yang <me@shenghaoyang.info> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: NipaLocal <nipa@local>

The MV88E6393X family of devices can run its cycle counter off an internal 250MHz clock instead of an external 125MHz one. Add support for this cycle counter period by adding another set of coefficients and lowering the periodic cycle counter read interval to compensate for faster overflows at the increased frequency. Otherwise, the PHC runs at 2x real time in userspace and cannot be synchronized. Fixes: de776d0 ("net: dsa: mv88e6xxx: add support for mv88e6393x family") Signed-off-by: Shenghao Yang <me@shenghaoyang.info> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: NipaLocal <nipa@local>

Remove hard-coded strings by using the helper function str_on_off(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: NipaLocal <nipa@local>

Replace the custom BOOL_TO_STR() macro with the str_true_false() helper function and remove the macro. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: NipaLocal <nipa@local>

…iver This patch adds support for another Lenovo Mini dock 0x17EF:0x3098 to the r8152 driver. The device has been tested on NixOS, hotplugging and sleep included. Signed-off-by: Benjamin Große <ste3ls@gmail.com> Signed-off-by: NipaLocal <nipa@local>

tc_actions.sh keeps hanging the forwarding tests. sdf@: tdc & tdc-dbg started intermittenly failing around Sep 25th Signed-off-by: NipaLocal <nipa@local>

Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: NipaLocal <nipa@local>

The buffers can easily get large and fail allocations. One of the networking tests running in a VM tries to run perf which occasionally ends with: perf: page allocation failure: order:6, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 Even tho free memory is available: free:97464 free_pcp:321 free_cma:0 Switch to kvzalloc() to make large allocations less likely to fail. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: NipaLocal <nipa@local>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netdev CI testing #6666

netdev CI testing #6666

Commits on Oct 19, 2024

Commits on Oct 20, 2024

netdev CI testing #6666

Are you sure you want to change the base?

netdev CI testing #6666

Commits on Oct 19, 2024

Commits on Oct 20, 2024