From f8a2da6ec2417cca169fa85a8ab15817bccbb109 Mon Sep 17 00:00:00 2001 From: Marc Kleine-Budde Date: Tue, 18 Jul 2023 11:43:54 +0200 Subject: [PATCH 01/49] can: gs_usb: gs_can_close(): add missing set of CAN state to CAN_STATE_STOPPED After an initial link up the CAN device is in ERROR-ACTIVE mode. Due to a missing CAN_STATE_STOPPED in gs_can_close() it doesn't change to STOPPED after a link down: | ip link set dev can0 up | ip link set dev can0 down | ip --details link show can0 | 13: can0: mtu 16 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 10 | link/can promiscuity 0 allmulti 0 minmtu 0 maxmtu 0 | can state ERROR-ACTIVE restart-ms 1000 Add missing assignment of CAN_STATE_STOPPED in gs_can_close(). Cc: stable@vger.kernel.org Fixes: d08e973a77d1 ("can: gs_usb: Added support for the GS_USB CAN devices") Link: https://lore.kernel.org/all/20230718-gs_usb-fix-can-state-v1-1-f19738ae2c23@pengutronix.de Signed-off-by: Marc Kleine-Budde --- drivers/net/can/usb/gs_usb.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/can/usb/gs_usb.c b/drivers/net/can/usb/gs_usb.c index f418066569fcc5..bd9eb066ecf15d 100644 --- a/drivers/net/can/usb/gs_usb.c +++ b/drivers/net/can/usb/gs_usb.c @@ -1030,6 +1030,8 @@ static int gs_can_close(struct net_device *netdev) usb_kill_anchored_urbs(&dev->tx_submitted); atomic_set(&dev->active_tx_urbs, 0); + dev->can.state = CAN_STATE_STOPPED; + /* reset the device */ rc = gs_cmd_reset(dev); if (rc < 0) From 11c9027c983e9e4b408ee5613b6504d24ebd85be Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 20 Jul 2023 11:44:38 +0000 Subject: [PATCH 02/49] can: raw: fix lockdep issue in raw_release() syzbot complained about a lockdep issue [1] Since raw_bind() and raw_setsockopt() first get RTNL before locking the socket, we must adopt the same order in raw_release() [1] WARNING: possible circular locking dependency detected 6.5.0-rc1-syzkaller-00192-g78adb4bcf99e #0 Not tainted ------------------------------------------------------ syz-executor.0/14110 is trying to acquire lock: ffff88804e4b6130 (sk_lock-AF_CAN){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1708 [inline] ffff88804e4b6130 (sk_lock-AF_CAN){+.+.}-{0:0}, at: raw_bind+0xb1/0xab0 net/can/raw.c:435 but task is already holding lock: ffffffff8e3df368 (rtnl_mutex){+.+.}-{3:3}, at: raw_bind+0xa7/0xab0 net/can/raw.c:434 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (rtnl_mutex){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:603 [inline] __mutex_lock+0x181/0x1340 kernel/locking/mutex.c:747 raw_release+0x1c6/0x9b0 net/can/raw.c:391 __sock_release+0xcd/0x290 net/socket.c:654 sock_close+0x1c/0x20 net/socket.c:1386 __fput+0x3fd/0xac0 fs/file_table.c:384 task_work_run+0x14d/0x240 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x210/0x240 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline] syscall_exit_to_user_mode+0x1d/0x50 kernel/entry/common.c:297 do_syscall_64+0x44/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (sk_lock-AF_CAN){+.+.}-{0:0}: check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x2e3d/0x5de0 kernel/locking/lockdep.c:5144 lock_acquire kernel/locking/lockdep.c:5761 [inline] lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5726 lock_sock_nested+0x3a/0xf0 net/core/sock.c:3492 lock_sock include/net/sock.h:1708 [inline] raw_bind+0xb1/0xab0 net/can/raw.c:435 __sys_bind+0x1ec/0x220 net/socket.c:1792 __do_sys_bind net/socket.c:1803 [inline] __se_sys_bind net/socket.c:1801 [inline] __x64_sys_bind+0x72/0xb0 net/socket.c:1801 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(rtnl_mutex); lock(sk_lock-AF_CAN); lock(rtnl_mutex); lock(sk_lock-AF_CAN); *** DEADLOCK *** 1 lock held by syz-executor.0/14110: stack backtrace: CPU: 0 PID: 14110 Comm: syz-executor.0 Not tainted 6.5.0-rc1-syzkaller-00192-g78adb4bcf99e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106 check_noncircular+0x311/0x3f0 kernel/locking/lockdep.c:2195 check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x2e3d/0x5de0 kernel/locking/lockdep.c:5144 lock_acquire kernel/locking/lockdep.c:5761 [inline] lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5726 lock_sock_nested+0x3a/0xf0 net/core/sock.c:3492 lock_sock include/net/sock.h:1708 [inline] raw_bind+0xb1/0xab0 net/can/raw.c:435 __sys_bind+0x1ec/0x220 net/socket.c:1792 __do_sys_bind net/socket.c:1803 [inline] __se_sys_bind net/socket.c:1801 [inline] __x64_sys_bind+0x72/0xb0 net/socket.c:1801 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fd89007cb29 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fd890d2a0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000031 RAX: ffffffffffffffda RBX: 00007fd89019bf80 RCX: 00007fd89007cb29 RDX: 0000000000000010 RSI: 0000000020000040 RDI: 0000000000000003 RBP: 00007fd8900c847a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fd89019bf80 R15: 00007ffebf8124f8 Fixes: ee8b94c8510c ("can: raw: fix receiver memory leak") Reported-by: syzbot Signed-off-by: Eric Dumazet Cc: Ziyang Xuan Cc: Oliver Hartkopp Cc: stable@vger.kernel.org Cc: Marc Kleine-Budde Link: https://lore.kernel.org/all/20230720114438.172434-1-edumazet@google.com Signed-off-by: Marc Kleine-Budde --- net/can/raw.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/can/raw.c b/net/can/raw.c index 2302e488296773..ba6b52b1d7767f 100644 --- a/net/can/raw.c +++ b/net/can/raw.c @@ -386,9 +386,9 @@ static int raw_release(struct socket *sock) list_del(&ro->notifier); spin_unlock(&raw_notifier_lock); + rtnl_lock(); lock_sock(sk); - rtnl_lock(); /* remove current filters & unregister */ if (ro->bound) { if (ro->dev) @@ -405,12 +405,13 @@ static int raw_release(struct socket *sock) ro->dev = NULL; ro->count = 0; free_percpu(ro->uniq); - rtnl_unlock(); sock_orphan(sk); sock->sk = NULL; release_sock(sk); + rtnl_unlock(); + sock_put(sk); return 0; From 043b1f185fb0f3939b7427f634787706f45411c4 Mon Sep 17 00:00:00 2001 From: Wang Ming Date: Thu, 13 Jul 2023 09:42:39 +0800 Subject: [PATCH 03/49] i40e: Fix an NULL vs IS_ERR() bug for debugfs_create_dir() The debugfs_create_dir() function returns error pointers. It never returns NULL. Most incorrect error checks were fixed, but the one in i40e_dbg_init() was forgotten. Fix the remaining error check. Fixes: 02e9c290814c ("i40e: debugfs interface") Signed-off-by: Wang Ming Tested-by: Pucha Himasekhar Reddy (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/i40e/i40e_debugfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c index 9954493cd44893..62497f5565c59d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c +++ b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c @@ -1839,7 +1839,7 @@ void i40e_dbg_pf_exit(struct i40e_pf *pf) void i40e_dbg_init(void) { i40e_dbg_root = debugfs_create_dir(i40e_driver_name, NULL); - if (!i40e_dbg_root) + if (IS_ERR(i40e_dbg_root)) pr_info("init of debugfs failed\n"); } From a2f054c10bef0b54600ec9cb776508443e941343 Mon Sep 17 00:00:00 2001 From: Jacob Keller Date: Mon, 10 Jul 2023 13:41:27 -0700 Subject: [PATCH 04/49] iavf: fix potential deadlock on allocation failure In iavf_adminq_task(), if kzalloc() fails to allocate the event.msg_buf, the function will exit without releasing the adapter->crit_lock. This is unlikely, but if it happens, the next access to that mutex will deadlock. Fix this by moving the unlock to the end of the function, and adding a new label to allow jumping to the unlock portion of the function exit flow. Fixes: fc2e6b3b132a ("iavf: Rework mutexes for better synchronisation") Signed-off-by: Jacob Keller Tested-by: Rafal Romanowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/iavf/iavf_main.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c index 3a88d413ddee36..939c8126eb437a 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_main.c +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c @@ -3264,7 +3264,7 @@ static void iavf_adminq_task(struct work_struct *work) event.buf_len = IAVF_MAX_AQ_BUF_SIZE; event.msg_buf = kzalloc(event.buf_len, GFP_KERNEL); if (!event.msg_buf) - goto out; + goto unlock; do { ret = iavf_clean_arq_element(hw, &event, &pending); @@ -3279,7 +3279,6 @@ static void iavf_adminq_task(struct work_struct *work) if (pending != 0) memset(event.msg_buf, 0, IAVF_MAX_AQ_BUF_SIZE); } while (pending); - mutex_unlock(&adapter->crit_lock); if (iavf_is_reset_in_progress(adapter)) goto freedom; @@ -3323,6 +3322,8 @@ static void iavf_adminq_task(struct work_struct *work) freedom: kfree(event.msg_buf); +unlock: + mutex_unlock(&adapter->crit_lock); out: /* re-enable Admin queue interrupt cause */ iavf_misc_irq_enable(adapter); From 91896c8acce23d33ed078cffd46a9534b1f82be5 Mon Sep 17 00:00:00 2001 From: Jacob Keller Date: Mon, 10 Jul 2023 13:41:28 -0700 Subject: [PATCH 05/49] iavf: check for removal state before IAVF_FLAG_PF_COMMS_FAILED In iavf_adminq_task(), if the function can't acquire the adapter->crit_lock, it checks if the driver is removing. If so, it simply exits without re-enabling the interrupt. This is done to ensure that the task stops processing as soon as possible once the driver is being removed. However, if the IAVF_FLAG_PF_COMMS_FAILED is set, the function checks this before attempting to acquire the lock. In this case, the function exits early and re-enables the interrupt. This will happen even if the driver is already removing. Avoid this, by moving the check to after the adapter->crit_lock is acquired. This way, if the driver is removing, we will not re-enable the interrupt. Fixes: fc2e6b3b132a ("iavf: Rework mutexes for better synchronisation") Signed-off-by: Jacob Keller Tested-by: Rafal Romanowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/iavf/iavf_main.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c index 939c8126eb437a..9610ca770349e4 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_main.c +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c @@ -3250,9 +3250,6 @@ static void iavf_adminq_task(struct work_struct *work) u32 val, oldval; u16 pending; - if (adapter->flags & IAVF_FLAG_PF_COMMS_FAILED) - goto out; - if (!mutex_trylock(&adapter->crit_lock)) { if (adapter->state == __IAVF_REMOVE) return; @@ -3261,6 +3258,9 @@ static void iavf_adminq_task(struct work_struct *work) goto out; } + if (adapter->flags & IAVF_FLAG_PF_COMMS_FAILED) + goto unlock; + event.buf_len = IAVF_MAX_AQ_BUF_SIZE; event.msg_buf = kzalloc(event.buf_len, GFP_KERNEL); if (!event.msg_buf) From 32ad45b76990ece9c5dd1fe7aae6e688c3baa647 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Thu, 20 Jul 2023 09:13:23 -0700 Subject: [PATCH 06/49] docs: net: clarify the NAPI rules around XDP Tx page pool and XDP should not be accessed from IRQ context which may happen if drivers try to clean up XDP TX with NAPI budget of 0. Link: https://lore.kernel.org/r/20230720161323.2025379-1-kuba@kernel.org Signed-off-by: Jakub Kicinski --- Documentation/networking/napi.rst | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst index a7a047742e9341..7bf7b95c4f7af3 100644 --- a/Documentation/networking/napi.rst +++ b/Documentation/networking/napi.rst @@ -65,15 +65,16 @@ argument - drivers can process completions for any number of Tx packets but should only process up to ``budget`` number of Rx packets. Rx processing is usually much more expensive. -In other words, it is recommended to ignore the budget argument when -performing TX buffer reclamation to ensure that the reclamation is not -arbitrarily bounded; however, it is required to honor the budget argument -for RX processing. +In other words for Rx processing the ``budget`` argument limits how many +packets driver can process in a single poll. Rx specific APIs like page +pool or XDP cannot be used at all when ``budget`` is 0. +skb Tx processing should happen regardless of the ``budget``, but if +the argument is 0 driver cannot call any XDP (or page pool) APIs. .. warning:: - The ``budget`` argument may be 0 if core tries to only process Tx completions - and no Rx packets. + The ``budget`` argument may be 0 if core tries to only process + skb Tx completions and no Rx or XDP packets. The poll method returns the amount of work done. If the driver still has outstanding work to do (e.g. ``budget`` was exhausted) From c7b75bea853daeb64fc831dbf39a6bbabcc402ac Mon Sep 17 00:00:00 2001 From: Jiawen Wu Date: Wed, 19 Jul 2023 17:22:33 +0800 Subject: [PATCH 07/49] net: phy: marvell10g: fix 88x3310 power up Clear MV_V2_PORT_CTRL_PWRDOWN bit to set power up for 88x3310 PHY, it sometimes does not take effect immediately. And a read of this register causes the bit not to clear. This will cause mv3310_reset() to time out, which will fail the config initialization. So add a delay before the next access. Fixes: c9cc1c815d36 ("net: phy: marvell10g: place in powersave mode at probe") Signed-off-by: Jiawen Wu Reviewed-by: Russell King (Oracle) Signed-off-by: David S. Miller --- drivers/net/phy/marvell10g.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/net/phy/marvell10g.c b/drivers/net/phy/marvell10g.c index 55d9d7acc32eb0..d4bb90d7688111 100644 --- a/drivers/net/phy/marvell10g.c +++ b/drivers/net/phy/marvell10g.c @@ -328,6 +328,13 @@ static int mv3310_power_up(struct phy_device *phydev) ret = phy_clear_bits_mmd(phydev, MDIO_MMD_VEND2, MV_V2_PORT_CTRL, MV_V2_PORT_CTRL_PWRDOWN); + /* Sometimes, the power down bit doesn't clear immediately, and + * a read of this register causes the bit not to clear. Delay + * 100us to allow the PHY to come out of power down mode before + * the next access. + */ + udelay(100); + if (phydev->drv->phy_id != MARVELL_PHY_ID_88X3310 || priv->firmware_ver < 0x00030000) return ret; From b27d0232e8897f7c896dc8ad80c9907dd57fd3f3 Mon Sep 17 00:00:00 2001 From: Hao Lan Date: Thu, 20 Jul 2023 10:05:07 +0800 Subject: [PATCH 08/49] net: hns3: fix the imp capability bit cannot exceed 32 bits issue Current only the first 32 bits of the capability flag bit are considered. When the matching capability flag bit is greater than 31 bits, it will get an error bit.This patch use bitmap to solve this issue. It can handle each capability bit whitout bit width limit. Fixes: da77aef9cc58 ("net: hns3: create common cmdq resource allocate/free/query APIs") Signed-off-by: Hao Lan Signed-off-by: Jijie Shao Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 3 ++- .../hns3/hns3_common/hclge_comm_cmd.c | 21 ++++++++++++++++--- 2 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index b99d75260d5947..7de167329b1e01 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -407,7 +408,7 @@ struct hnae3_ae_dev { unsigned long hw_err_reset_req; struct hnae3_dev_specs dev_specs; u32 dev_version; - unsigned long caps[BITS_TO_LONGS(HNAE3_DEV_CAPS_MAX_NUM)]; + DECLARE_BITMAP(caps, HNAE3_DEV_CAPS_MAX_NUM); void *priv; }; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c index b85c412683ddc2..16ba98ff2c9b1a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c @@ -171,6 +171,20 @@ static const struct hclge_comm_caps_bit_map hclge_vf_cmd_caps[] = { {HCLGE_COMM_CAP_GRO_B, HNAE3_DEV_SUPPORT_GRO_B}, }; +static void +hclge_comm_capability_to_bitmap(unsigned long *bitmap, __le32 *caps) +{ + const unsigned int words = HCLGE_COMM_QUERY_CAP_LENGTH; + u32 val[HCLGE_COMM_QUERY_CAP_LENGTH]; + unsigned int i; + + for (i = 0; i < words; i++) + val[i] = __le32_to_cpu(caps[i]); + + bitmap_from_arr32(bitmap, val, + HCLGE_COMM_QUERY_CAP_LENGTH * BITS_PER_TYPE(u32)); +} + static void hclge_comm_parse_capability(struct hnae3_ae_dev *ae_dev, bool is_pf, struct hclge_comm_query_version_cmd *cmd) @@ -179,11 +193,12 @@ hclge_comm_parse_capability(struct hnae3_ae_dev *ae_dev, bool is_pf, is_pf ? hclge_pf_cmd_caps : hclge_vf_cmd_caps; u32 size = is_pf ? ARRAY_SIZE(hclge_pf_cmd_caps) : ARRAY_SIZE(hclge_vf_cmd_caps); - u32 caps, i; + DECLARE_BITMAP(caps, HCLGE_COMM_QUERY_CAP_LENGTH * BITS_PER_TYPE(u32)); + u32 i; - caps = __le32_to_cpu(cmd->caps[0]); + hclge_comm_capability_to_bitmap(caps, cmd->caps); for (i = 0; i < size; i++) - if (hnae3_get_bit(caps, caps_map[i].imp_bit)) + if (test_bit(caps_map[i].imp_bit, caps)) set_bit(caps_map[i].local_bit, ae_dev->caps); } From 6d2336120aa6e1a8a64fa5d6ee5c3f3d0809fe9b Mon Sep 17 00:00:00 2001 From: Hao Lan Date: Thu, 20 Jul 2023 10:05:08 +0800 Subject: [PATCH 09/49] net: hns3: add tm flush when setting tm When the tm module is configured with traffic, traffic may be abnormal. This patch fixes this problem. Before the tm module is configured, traffic processing should be stopped. After the tm module is configured, traffic processing is enabled. Signed-off-by: Hao Lan Signed-off-by: Jijie Shao Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 4 +++ .../hns3/hns3_common/hclge_comm_cmd.c | 1 + .../hns3/hns3_common/hclge_comm_cmd.h | 2 ++ .../ethernet/hisilicon/hns3/hns3_debugfs.c | 3 ++ .../hisilicon/hns3/hns3pf/hclge_dcb.c | 34 ++++++++++++++++--- .../ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 31 ++++++++++++++++- .../ethernet/hisilicon/hns3/hns3pf/hclge_tm.h | 4 +++ 7 files changed, 73 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 7de167329b1e01..514a20bce4f449 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -102,6 +102,7 @@ enum HNAE3_DEV_CAP_BITS { HNAE3_DEV_SUPPORT_FEC_STATS_B, HNAE3_DEV_SUPPORT_LANE_NUM_B, HNAE3_DEV_SUPPORT_WOL_B, + HNAE3_DEV_SUPPORT_TM_FLUSH_B, }; #define hnae3_ae_dev_fd_supported(ae_dev) \ @@ -173,6 +174,9 @@ enum HNAE3_DEV_CAP_BITS { #define hnae3_ae_dev_wol_supported(ae_dev) \ test_bit(HNAE3_DEV_SUPPORT_WOL_B, (ae_dev)->caps) +#define hnae3_ae_dev_tm_flush_supported(hdev) \ + test_bit(HNAE3_DEV_SUPPORT_TM_FLUSH_B, (hdev)->ae_dev->caps) + enum HNAE3_PF_CAP_BITS { HNAE3_PF_SUPPORT_VLAN_FLTR_MDF_B = 0, }; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c index 16ba98ff2c9b1a..dcecb23daac6e1 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c @@ -156,6 +156,7 @@ static const struct hclge_comm_caps_bit_map hclge_pf_cmd_caps[] = { {HCLGE_COMM_CAP_FEC_STATS_B, HNAE3_DEV_SUPPORT_FEC_STATS_B}, {HCLGE_COMM_CAP_LANE_NUM_B, HNAE3_DEV_SUPPORT_LANE_NUM_B}, {HCLGE_COMM_CAP_WOL_B, HNAE3_DEV_SUPPORT_WOL_B}, + {HCLGE_COMM_CAP_TM_FLUSH_B, HNAE3_DEV_SUPPORT_TM_FLUSH_B}, }; static const struct hclge_comm_caps_bit_map hclge_vf_cmd_caps[] = { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h index 18f1b4bf362da9..2b7197ce0ae8fc 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h @@ -153,6 +153,7 @@ enum hclge_opcode_type { HCLGE_OPC_TM_INTERNAL_STS = 0x0850, HCLGE_OPC_TM_INTERNAL_CNT = 0x0851, HCLGE_OPC_TM_INTERNAL_STS_1 = 0x0852, + HCLGE_OPC_TM_FLUSH = 0x0872, /* Packet buffer allocate commands */ HCLGE_OPC_TX_BUFF_ALLOC = 0x0901, @@ -349,6 +350,7 @@ enum HCLGE_COMM_CAP_BITS { HCLGE_COMM_CAP_FEC_STATS_B = 25, HCLGE_COMM_CAP_LANE_NUM_B = 27, HCLGE_COMM_CAP_WOL_B = 28, + HCLGE_COMM_CAP_TM_FLUSH_B = 31, }; enum HCLGE_COMM_API_CAP_BITS { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index 6546cfe7f7cc71..52546f625c8b0b 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -411,6 +411,9 @@ static struct hns3_dbg_cap_info hns3_dbg_cap[] = { }, { .name = "support wake on lan", .cap_bit = HNAE3_DEV_SUPPORT_WOL_B, + }, { + .name = "support tm flush", + .cap_bit = HNAE3_DEV_SUPPORT_TM_FLUSH_B, } }; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c index c4aded65e848bf..cda7e0d0ba56ef 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c @@ -216,6 +216,10 @@ static int hclge_notify_down_uinit(struct hclge_dev *hdev) if (ret) return ret; + ret = hclge_tm_flush_cfg(hdev, true); + if (ret) + return ret; + return hclge_notify_client(hdev, HNAE3_UNINIT_CLIENT); } @@ -227,6 +231,10 @@ static int hclge_notify_init_up(struct hclge_dev *hdev) if (ret) return ret; + ret = hclge_tm_flush_cfg(hdev, false); + if (ret) + return ret; + return hclge_notify_client(hdev, HNAE3_UP_CLIENT); } @@ -313,6 +321,7 @@ static int hclge_ieee_setpfc(struct hnae3_handle *h, struct ieee_pfc *pfc) struct net_device *netdev = h->kinfo.netdev; struct hclge_dev *hdev = vport->back; u8 i, j, pfc_map, *prio_tc; + int last_bad_ret = 0; int ret; if (!(hdev->dcbx_cap & DCB_CAP_DCBX_VER_IEEE)) @@ -350,13 +359,28 @@ static int hclge_ieee_setpfc(struct hnae3_handle *h, struct ieee_pfc *pfc) if (ret) return ret; - ret = hclge_buffer_alloc(hdev); - if (ret) { - hclge_notify_client(hdev, HNAE3_UP_CLIENT); + ret = hclge_tm_flush_cfg(hdev, true); + if (ret) return ret; - } - return hclge_notify_client(hdev, HNAE3_UP_CLIENT); + /* No matter whether the following operations are performed + * successfully or not, disabling the tm flush and notify + * the network status to up are necessary. + * Do not return immediately. + */ + ret = hclge_buffer_alloc(hdev); + if (ret) + last_bad_ret = ret; + + ret = hclge_tm_flush_cfg(hdev, false); + if (ret) + last_bad_ret = ret; + + ret = hclge_notify_client(hdev, HNAE3_UP_CLIENT); + if (ret) + last_bad_ret = ret; + + return last_bad_ret; } static int hclge_ieee_setapp(struct hnae3_handle *h, struct dcb_app *app) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index 922c0da3660c7b..c40ea6b8c8ec64 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -1484,7 +1484,11 @@ int hclge_tm_schd_setup_hw(struct hclge_dev *hdev) return ret; /* Cfg schd mode for each level schd */ - return hclge_tm_schd_mode_hw(hdev); + ret = hclge_tm_schd_mode_hw(hdev); + if (ret) + return ret; + + return hclge_tm_flush_cfg(hdev, false); } static int hclge_pause_param_setup_hw(struct hclge_dev *hdev) @@ -2113,3 +2117,28 @@ int hclge_tm_get_port_shaper(struct hclge_dev *hdev, return 0; } + +int hclge_tm_flush_cfg(struct hclge_dev *hdev, bool enable) +{ + struct hclge_desc desc; + int ret; + + if (!hnae3_ae_dev_tm_flush_supported(hdev)) + return 0; + + hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_TM_FLUSH, false); + + desc.data[0] = cpu_to_le32(enable ? HCLGE_TM_FLUSH_EN_MSK : 0); + + ret = hclge_cmd_send(&hdev->hw, &desc, 1); + if (ret) { + dev_err(&hdev->pdev->dev, + "failed to config tm flush, ret = %d\n", ret); + return ret; + } + + if (enable) + msleep(HCLGE_TM_FLUSH_TIME_MS); + + return ret; +} diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h index dd6f1fd486cf24..45dcfef3f90cca 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h @@ -33,6 +33,9 @@ enum hclge_opcode_type; #define HCLGE_DSCP_MAP_TC_BD_NUM 2 #define HCLGE_DSCP_TC_SHIFT(n) (((n) & 1) * 4) +#define HCLGE_TM_FLUSH_TIME_MS 10 +#define HCLGE_TM_FLUSH_EN_MSK BIT(0) + struct hclge_pg_to_pri_link_cmd { u8 pg_id; u8 rsvd1[3]; @@ -272,4 +275,5 @@ int hclge_tm_get_port_shaper(struct hclge_dev *hdev, struct hclge_tm_shaper_para *para); int hclge_up_to_tc_map(struct hclge_dev *hdev); int hclge_dscp_to_tc_map(struct hclge_dev *hdev); +int hclge_tm_flush_cfg(struct hclge_dev *hdev, bool enable); #endif From 116d9f732eef634abbd871f2c6f613a5b4677742 Mon Sep 17 00:00:00 2001 From: Jijie Shao Date: Thu, 20 Jul 2023 10:05:09 +0800 Subject: [PATCH 10/49] net: hns3: fix wrong tc bandwidth weight data issue Currently, the weight saved by the driver is used as the query result, which may be different from the actual weight in the register. Therefore, the register value read from the firmware is used as the query result Fixes: 0e32038dc856 ("net: hns3: refactor dump tc of debugfs") Signed-off-by: Jijie Shao Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c index 233c132dc513e0..409db2e7096518 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c @@ -693,8 +693,7 @@ static int hclge_dbg_dump_tc(struct hclge_dev *hdev, char *buf, int len) for (i = 0; i < HNAE3_MAX_TC; i++) { sch_mode_str = ets_weight->tc_weight[i] ? "dwrr" : "sp"; pos += scnprintf(buf + pos, len - pos, "%u %4s %3u\n", - i, sch_mode_str, - hdev->tm_info.pg_info[0].tc_dwrr[i]); + i, sch_mode_str, ets_weight->tc_weight[i]); } return 0; From 882481b1c55fc44861d7e2d54b4e0936b1b39f2c Mon Sep 17 00:00:00 2001 From: Jijie Shao Date: Thu, 20 Jul 2023 10:05:10 +0800 Subject: [PATCH 11/49] net: hns3: fix wrong bw weight of disabled tc issue In dwrr mode, the default bandwidth weight of disabled tc is set to 0. If the bandwidth weight is 0, the mode will change to sp. Therefore, disabled tc default bandwidth weight need changed to 1, and 0 is returned when query the bandwidth weight of disabled tc. In addition, driver need stop configure bandwidth weight if tc is disabled. Fixes: 848440544b41 ("net: hns3: Add support of TX Scheduler & Shaper to HNS3 driver") Signed-off-by: Jie Wang Signed-off-by: Jijie Shao Signed-off-by: David S. Miller --- .../ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c | 17 ++++++++++++++--- .../ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 3 ++- 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c index cda7e0d0ba56ef..fad5a5ff3cda54 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c @@ -52,7 +52,10 @@ static void hclge_tm_info_to_ieee_ets(struct hclge_dev *hdev, for (i = 0; i < HNAE3_MAX_TC; i++) { ets->prio_tc[i] = hdev->tm_info.prio_tc[i]; - ets->tc_tx_bw[i] = hdev->tm_info.pg_info[0].tc_dwrr[i]; + if (i < hdev->tm_info.num_tc) + ets->tc_tx_bw[i] = hdev->tm_info.pg_info[0].tc_dwrr[i]; + else + ets->tc_tx_bw[i] = 0; if (hdev->tm_info.tc_info[i].tc_sch_mode == HCLGE_SCH_MODE_SP) @@ -123,7 +126,8 @@ static u8 hclge_ets_tc_changed(struct hclge_dev *hdev, struct ieee_ets *ets, } static int hclge_ets_sch_mode_validate(struct hclge_dev *hdev, - struct ieee_ets *ets, bool *changed) + struct ieee_ets *ets, bool *changed, + u8 tc_num) { bool has_ets_tc = false; u32 total_ets_bw = 0; @@ -137,6 +141,13 @@ static int hclge_ets_sch_mode_validate(struct hclge_dev *hdev, *changed = true; break; case IEEE_8021QAZ_TSA_ETS: + if (i >= tc_num) { + dev_err(&hdev->pdev->dev, + "tc%u is disabled, cannot set ets bw\n", + i); + return -EINVAL; + } + /* The hardware will switch to sp mode if bandwidth is * 0, so limit ets bandwidth must be greater than 0. */ @@ -176,7 +187,7 @@ static int hclge_ets_validate(struct hclge_dev *hdev, struct ieee_ets *ets, if (ret) return ret; - ret = hclge_ets_sch_mode_validate(hdev, ets, changed); + ret = hclge_ets_sch_mode_validate(hdev, ets, changed, tc_num); if (ret) return ret; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index c40ea6b8c8ec64..de509e5751a7c5 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -785,6 +785,7 @@ static void hclge_tm_tc_info_init(struct hclge_dev *hdev) static void hclge_tm_pg_info_init(struct hclge_dev *hdev) { #define BW_PERCENT 100 +#define DEFAULT_BW_WEIGHT 1 u8 i; @@ -806,7 +807,7 @@ static void hclge_tm_pg_info_init(struct hclge_dev *hdev) for (k = 0; k < hdev->tm_info.num_tc; k++) hdev->tm_info.pg_info[i].tc_dwrr[k] = BW_PERCENT; for (; k < HNAE3_MAX_TC; k++) - hdev->tm_info.pg_info[i].tc_dwrr[k] = 0; + hdev->tm_info.pg_info[i].tc_dwrr[k] = DEFAULT_BW_WEIGHT; } } From 94d166c5318c6edd1e079df8552233443e909c33 Mon Sep 17 00:00:00 2001 From: Jiri Benc Date: Thu, 20 Jul 2023 11:05:56 +0200 Subject: [PATCH 12/49] vxlan: calculate correct header length for GPE VXLAN-GPE does not add an extra inner Ethernet header. Take that into account when calculating header length. This causes problems in skb_tunnel_check_pmtu, where incorrect PMTU is cached. In the collect_md mode (which is the only mode that VXLAN-GPE supports), there's no magic auto-setting of the tunnel interface MTU. It can't be, since the destination and thus the underlying interface may be different for each packet. So, the administrator is responsible for setting the correct tunnel interface MTU. Apparently, the administrators are capable enough to calculate that the maximum MTU for VXLAN-GPE is (their_lower_MTU - 36). They set the tunnel interface MTU to 1464. If you run a TCP stream over such interface, it's then segmented according to the MTU 1464, i.e. producing 1514 bytes frames. Which is okay, this still fits the lower MTU. However, skb_tunnel_check_pmtu (called from vxlan_xmit_one) uses 50 as the header size and thus incorrectly calculates the frame size to be 1528. This leads to ICMP too big message being generated (locally), PMTU of 1450 to be cached and the TCP stream to be resegmented. The fix is to use the correct actual header size, especially for skb_tunnel_check_pmtu calculation. Fixes: e1e5314de08ba ("vxlan: implement GPE") Signed-off-by: Jiri Benc Reviewed-by: Simon Horman Signed-off-by: David S. Miller --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- drivers/net/vxlan/vxlan_core.c | 23 ++++++++----------- include/net/vxlan.h | 13 +++++++---- 3 files changed, 20 insertions(+), 18 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 1726297f2e0df0..8eb9839a3ca690 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -8479,7 +8479,7 @@ static void ixgbe_atr(struct ixgbe_ring *ring, struct ixgbe_adapter *adapter = q_vector->adapter; if (unlikely(skb_tail_pointer(skb) < hdr.network + - VXLAN_HEADROOM)) + vxlan_headroom(0))) return; /* verify the port is recognized as VXLAN */ diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c index 78744549c1b30d..42be8a26c17156 100644 --- a/drivers/net/vxlan/vxlan_core.c +++ b/drivers/net/vxlan/vxlan_core.c @@ -2516,7 +2516,7 @@ void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, } ndst = &rt->dst; - err = skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM, + err = skb_tunnel_check_pmtu(skb, ndst, vxlan_headroom(flags & VXLAN_F_GPE), netif_is_any_bridge_port(dev)); if (err < 0) { goto tx_error; @@ -2577,7 +2577,8 @@ void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, goto out_unlock; } - err = skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM, + err = skb_tunnel_check_pmtu(skb, ndst, + vxlan_headroom((flags & VXLAN_F_GPE) | VXLAN_F_IPV6), netif_is_any_bridge_port(dev)); if (err < 0) { goto tx_error; @@ -2989,14 +2990,12 @@ static int vxlan_change_mtu(struct net_device *dev, int new_mtu) struct vxlan_rdst *dst = &vxlan->default_dst; struct net_device *lowerdev = __dev_get_by_index(vxlan->net, dst->remote_ifindex); - bool use_ipv6 = !!(vxlan->cfg.flags & VXLAN_F_IPV6); /* This check is different than dev->max_mtu, because it looks at * the lowerdev->mtu, rather than the static dev->max_mtu */ if (lowerdev) { - int max_mtu = lowerdev->mtu - - (use_ipv6 ? VXLAN6_HEADROOM : VXLAN_HEADROOM); + int max_mtu = lowerdev->mtu - vxlan_headroom(vxlan->cfg.flags); if (new_mtu > max_mtu) return -EINVAL; } @@ -3644,11 +3643,11 @@ static void vxlan_config_apply(struct net_device *dev, struct vxlan_dev *vxlan = netdev_priv(dev); struct vxlan_rdst *dst = &vxlan->default_dst; unsigned short needed_headroom = ETH_HLEN; - bool use_ipv6 = !!(conf->flags & VXLAN_F_IPV6); int max_mtu = ETH_MAX_MTU; + u32 flags = conf->flags; if (!changelink) { - if (conf->flags & VXLAN_F_GPE) + if (flags & VXLAN_F_GPE) vxlan_raw_setup(dev); else vxlan_ether_setup(dev); @@ -3673,8 +3672,7 @@ static void vxlan_config_apply(struct net_device *dev, dev->needed_tailroom = lowerdev->needed_tailroom; - max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM : - VXLAN_HEADROOM); + max_mtu = lowerdev->mtu - vxlan_headroom(flags); if (max_mtu < ETH_MIN_MTU) max_mtu = ETH_MIN_MTU; @@ -3685,10 +3683,9 @@ static void vxlan_config_apply(struct net_device *dev, if (dev->mtu > max_mtu) dev->mtu = max_mtu; - if (use_ipv6 || conf->flags & VXLAN_F_COLLECT_METADATA) - needed_headroom += VXLAN6_HEADROOM; - else - needed_headroom += VXLAN_HEADROOM; + if (flags & VXLAN_F_COLLECT_METADATA) + flags |= VXLAN_F_IPV6; + needed_headroom += vxlan_headroom(flags); dev->needed_headroom = needed_headroom; memcpy(&vxlan->cfg, conf, sizeof(*conf)); diff --git a/include/net/vxlan.h b/include/net/vxlan.h index 0be91ca78d3abf..1648240c966801 100644 --- a/include/net/vxlan.h +++ b/include/net/vxlan.h @@ -386,10 +386,15 @@ static inline netdev_features_t vxlan_features_check(struct sk_buff *skb, return features; } -/* IP header + UDP + VXLAN + Ethernet header */ -#define VXLAN_HEADROOM (20 + 8 + 8 + 14) -/* IPv6 header + UDP + VXLAN + Ethernet header */ -#define VXLAN6_HEADROOM (40 + 8 + 8 + 14) +static inline int vxlan_headroom(u32 flags) +{ + /* VXLAN: IP4/6 header + UDP + VXLAN + Ethernet header */ + /* VXLAN-GPE: IP4/6 header + UDP + VXLAN */ + return (flags & VXLAN_F_IPV6 ? sizeof(struct ipv6hdr) : + sizeof(struct iphdr)) + + sizeof(struct udphdr) + sizeof(struct vxlanhdr) + + (flags & VXLAN_F_GPE ? 0 : ETH_HLEN); +} static inline struct vxlanhdr *vxlan_hdr(struct sk_buff *skb) { From 8d01da0a1db237c44c92859ce3612df7af8d3a53 Mon Sep 17 00:00:00 2001 From: Yuanjun Gong Date: Thu, 20 Jul 2023 22:42:08 +0800 Subject: [PATCH 13/49] ethernet: atheros: fix return value check in atl1c_tso_csum() in atl1c_tso_csum, it should check the return value of pskb_trim(), and return an error code if an unexpected value is returned by pskb_trim(). Signed-off-by: Yuanjun Gong Reviewed-by: Simon Horman Signed-off-by: David S. Miller --- drivers/net/ethernet/atheros/atl1c/atl1c_main.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c index 4a288799633f88..940c5d1ff9cfce 100644 --- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c +++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c @@ -2094,8 +2094,11 @@ static int atl1c_tso_csum(struct atl1c_adapter *adapter, real_len = (((unsigned char *)ip_hdr(skb) - skb->data) + ntohs(ip_hdr(skb)->tot_len)); - if (real_len < skb->len) - pskb_trim(skb, real_len); + if (real_len < skb->len) { + err = pskb_trim(skb, real_len); + if (err) + return err; + } hdr_len = skb_tcp_all_headers(skb); if (unlikely(skb->len == hdr_len)) { From 17a0a64448b568442a101de09575f81ffdc45d15 Mon Sep 17 00:00:00 2001 From: Jiri Benc Date: Fri, 21 Jul 2023 16:30:46 +0200 Subject: [PATCH 14/49] vxlan: generalize vxlan_parse_gpe_hdr and remove unused args The vxlan_parse_gpe_hdr function extracts the next protocol value from the GPE header and marks GPE bits as parsed. In order to be used in the next patch, split the function into protocol extraction and bit marking. The bit marking is meaningful only in vxlan_rcv; move it directly there. Rename the function to vxlan_parse_gpe_proto to reflect what it now does. Remove unused arguments skb and vxflags. Move the function earlier in the file to allow it to be called from more places in the next patch. Signed-off-by: Jiri Benc Signed-off-by: David S. Miller --- drivers/net/vxlan/vxlan_core.c | 58 ++++++++++++++++------------------ 1 file changed, 28 insertions(+), 30 deletions(-) diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c index 42be8a26c17156..6d82adf1f8b3d1 100644 --- a/drivers/net/vxlan/vxlan_core.c +++ b/drivers/net/vxlan/vxlan_core.c @@ -623,6 +623,32 @@ static int vxlan_fdb_append(struct vxlan_fdb *f, return 1; } +static bool vxlan_parse_gpe_proto(struct vxlanhdr *hdr, __be16 *protocol) +{ + struct vxlanhdr_gpe *gpe = (struct vxlanhdr_gpe *)hdr; + + /* Need to have Next Protocol set for interfaces in GPE mode. */ + if (!gpe->np_applied) + return false; + /* "The initial version is 0. If a receiver does not support the + * version indicated it MUST drop the packet. + */ + if (gpe->version != 0) + return false; + /* "When the O bit is set to 1, the packet is an OAM packet and OAM + * processing MUST occur." However, we don't implement OAM + * processing, thus drop the packet. + */ + if (gpe->oam_flag) + return false; + + *protocol = tun_p_to_eth_p(gpe->next_protocol); + if (!*protocol) + return false; + + return true; +} + static struct vxlanhdr *vxlan_gro_remcsum(struct sk_buff *skb, unsigned int off, struct vxlanhdr *vh, size_t hdrlen, @@ -1525,35 +1551,6 @@ static void vxlan_parse_gbp_hdr(struct vxlanhdr *unparsed, unparsed->vx_flags &= ~VXLAN_GBP_USED_BITS; } -static bool vxlan_parse_gpe_hdr(struct vxlanhdr *unparsed, - __be16 *protocol, - struct sk_buff *skb, u32 vxflags) -{ - struct vxlanhdr_gpe *gpe = (struct vxlanhdr_gpe *)unparsed; - - /* Need to have Next Protocol set for interfaces in GPE mode. */ - if (!gpe->np_applied) - return false; - /* "The initial version is 0. If a receiver does not support the - * version indicated it MUST drop the packet. - */ - if (gpe->version != 0) - return false; - /* "When the O bit is set to 1, the packet is an OAM packet and OAM - * processing MUST occur." However, we don't implement OAM - * processing, thus drop the packet. - */ - if (gpe->oam_flag) - return false; - - *protocol = tun_p_to_eth_p(gpe->next_protocol); - if (!*protocol) - return false; - - unparsed->vx_flags &= ~VXLAN_GPE_USED_BITS; - return true; -} - static bool vxlan_set_mac(struct vxlan_dev *vxlan, struct vxlan_sock *vs, struct sk_buff *skb, __be32 vni) @@ -1655,8 +1652,9 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff *skb) * used by VXLAN extensions if explicitly requested. */ if (vs->flags & VXLAN_F_GPE) { - if (!vxlan_parse_gpe_hdr(&unparsed, &protocol, skb, vs->flags)) + if (!vxlan_parse_gpe_proto(&unparsed, &protocol)) goto drop; + unparsed.vx_flags &= ~VXLAN_GPE_USED_BITS; raw_proto = true; } From b0b672c4d0957e5897685667fc848132b8bd2d71 Mon Sep 17 00:00:00 2001 From: Jiri Benc Date: Fri, 21 Jul 2023 16:30:47 +0200 Subject: [PATCH 15/49] vxlan: fix GRO with VXLAN-GPE In VXLAN-GPE, there may not be an Ethernet header following the VXLAN header. But in GRO, the vxlan driver calls eth_gro_receive unconditionally, which means the following header is incorrectly parsed as Ethernet. Introduce GPE specific GRO handling. For better performance, do not check for GPE during GRO but rather install a different set of functions at setup time. Fixes: e1e5314de08ba ("vxlan: implement GPE") Reported-by: Paolo Abeni Signed-off-by: Jiri Benc Signed-off-by: David S. Miller --- drivers/net/vxlan/vxlan_core.c | 84 ++++++++++++++++++++++++++++------ 1 file changed, 69 insertions(+), 15 deletions(-) diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c index 6d82adf1f8b3d1..c9a9373733c015 100644 --- a/drivers/net/vxlan/vxlan_core.c +++ b/drivers/net/vxlan/vxlan_core.c @@ -675,26 +675,24 @@ static struct vxlanhdr *vxlan_gro_remcsum(struct sk_buff *skb, return vh; } -static struct sk_buff *vxlan_gro_receive(struct sock *sk, - struct list_head *head, - struct sk_buff *skb) +static struct vxlanhdr *vxlan_gro_prepare_receive(struct sock *sk, + struct list_head *head, + struct sk_buff *skb, + struct gro_remcsum *grc) { - struct sk_buff *pp = NULL; struct sk_buff *p; struct vxlanhdr *vh, *vh2; unsigned int hlen, off_vx; - int flush = 1; struct vxlan_sock *vs = rcu_dereference_sk_user_data(sk); __be32 flags; - struct gro_remcsum grc; - skb_gro_remcsum_init(&grc); + skb_gro_remcsum_init(grc); off_vx = skb_gro_offset(skb); hlen = off_vx + sizeof(*vh); vh = skb_gro_header(skb, hlen, off_vx); if (unlikely(!vh)) - goto out; + return NULL; skb_gro_postpull_rcsum(skb, vh, sizeof(struct vxlanhdr)); @@ -702,12 +700,12 @@ static struct sk_buff *vxlan_gro_receive(struct sock *sk, if ((flags & VXLAN_HF_RCO) && (vs->flags & VXLAN_F_REMCSUM_RX)) { vh = vxlan_gro_remcsum(skb, off_vx, vh, sizeof(struct vxlanhdr), - vh->vx_vni, &grc, + vh->vx_vni, grc, !!(vs->flags & VXLAN_F_REMCSUM_NOPARTIAL)); if (!vh) - goto out; + return NULL; } skb_gro_pull(skb, sizeof(struct vxlanhdr)); /* pull vxlan header */ @@ -724,12 +722,48 @@ static struct sk_buff *vxlan_gro_receive(struct sock *sk, } } - pp = call_gro_receive(eth_gro_receive, head, skb); - flush = 0; + return vh; +} + +static struct sk_buff *vxlan_gro_receive(struct sock *sk, + struct list_head *head, + struct sk_buff *skb) +{ + struct sk_buff *pp = NULL; + struct gro_remcsum grc; + int flush = 1; -out: + if (vxlan_gro_prepare_receive(sk, head, skb, &grc)) { + pp = call_gro_receive(eth_gro_receive, head, skb); + flush = 0; + } skb_gro_flush_final_remcsum(skb, pp, flush, &grc); + return pp; +} +static struct sk_buff *vxlan_gpe_gro_receive(struct sock *sk, + struct list_head *head, + struct sk_buff *skb) +{ + const struct packet_offload *ptype; + struct sk_buff *pp = NULL; + struct gro_remcsum grc; + struct vxlanhdr *vh; + __be16 protocol; + int flush = 1; + + vh = vxlan_gro_prepare_receive(sk, head, skb, &grc); + if (vh) { + if (!vxlan_parse_gpe_proto(vh, &protocol)) + goto out; + ptype = gro_find_receive_by_type(protocol); + if (!ptype) + goto out; + pp = call_gro_receive(ptype->callbacks.gro_receive, head, skb); + flush = 0; + } +out: + skb_gro_flush_final_remcsum(skb, pp, flush, &grc); return pp; } @@ -741,6 +775,21 @@ static int vxlan_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff) return eth_gro_complete(skb, nhoff + sizeof(struct vxlanhdr)); } +static int vxlan_gpe_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff) +{ + struct vxlanhdr *vh = (struct vxlanhdr *)(skb->data + nhoff); + const struct packet_offload *ptype; + int err = -ENOSYS; + __be16 protocol; + + if (!vxlan_parse_gpe_proto(vh, &protocol)) + return err; + ptype = gro_find_complete_by_type(protocol); + if (ptype) + err = ptype->callbacks.gro_complete(skb, nhoff + sizeof(struct vxlanhdr)); + return err; +} + static struct vxlan_fdb *vxlan_fdb_alloc(struct vxlan_dev *vxlan, const u8 *mac, __u16 state, __be32 src_vni, __u16 ndm_flags) @@ -3376,8 +3425,13 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, bool ipv6, tunnel_cfg.encap_rcv = vxlan_rcv; tunnel_cfg.encap_err_lookup = vxlan_err_lookup; tunnel_cfg.encap_destroy = NULL; - tunnel_cfg.gro_receive = vxlan_gro_receive; - tunnel_cfg.gro_complete = vxlan_gro_complete; + if (vs->flags & VXLAN_F_GPE) { + tunnel_cfg.gro_receive = vxlan_gpe_gro_receive; + tunnel_cfg.gro_complete = vxlan_gpe_gro_complete; + } else { + tunnel_cfg.gro_receive = vxlan_gro_receive; + tunnel_cfg.gro_complete = vxlan_gro_complete; + } setup_udp_tunnel_sock(net, sock, &tunnel_cfg); From ed96824b71ed67664390890441b229423a25317f Mon Sep 17 00:00:00 2001 From: Yuanjun Gong Date: Sat, 22 Jul 2023 22:25:11 +0800 Subject: [PATCH 16/49] atheros: fix return value check in atl1_tso() in atl1_tso(), it should check the return value of pskb_trim(), and return an error code if an unexpected value is returned by pskb_trim(). Fixes: 401c0aabec4b ("atl1: simplify tx packet descriptor") Signed-off-by: Yuanjun Gong Link: https://lore.kernel.org/r/20230722142511.12448-1-ruc_gongyuanjun@163.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/atheros/atlx/atl1.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/atheros/atlx/atl1.c b/drivers/net/ethernet/atheros/atlx/atl1.c index c8444bcdf52707..02aa6fd8ebc2d4 100644 --- a/drivers/net/ethernet/atheros/atlx/atl1.c +++ b/drivers/net/ethernet/atheros/atlx/atl1.c @@ -2113,8 +2113,11 @@ static int atl1_tso(struct atl1_adapter *adapter, struct sk_buff *skb, real_len = (((unsigned char *)iph - skb->data) + ntohs(iph->tot_len)); - if (real_len < skb->len) - pskb_trim(skb, real_len); + if (real_len < skb->len) { + err = pskb_trim(skb, real_len); + if (err) + return err; + } hdr_len = skb_tcp_all_headers(skb); if (skb->len == hdr_len) { iph->check = 0; From 69a184f7a372aac588babfb0bd681aaed9779f5b Mon Sep 17 00:00:00 2001 From: Yuanjun Gong Date: Thu, 20 Jul 2023 22:42:19 +0800 Subject: [PATCH 17/49] ethernet: atheros: fix return value check in atl1e_tso_csum() in atl1e_tso_csum, it should check the return value of pskb_trim(), and return an error code if an unexpected value is returned by pskb_trim(). Fixes: a6a5325239c2 ("atl1e: Atheros L1E Gigabit Ethernet driver") Signed-off-by: Yuanjun Gong Reviewed-by: Simon Horman Link: https://lore.kernel.org/r/20230720144219.39285-1-ruc_gongyuanjun@163.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/atheros/atl1e/atl1e_main.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c index 5db0f3495a32e6..5935be190b9e22 100644 --- a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c +++ b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c @@ -1641,8 +1641,11 @@ static int atl1e_tso_csum(struct atl1e_adapter *adapter, real_len = (((unsigned char *)ip_hdr(skb) - skb->data) + ntohs(ip_hdr(skb)->tot_len)); - if (real_len < skb->len) - pskb_trim(skb, real_len); + if (real_len < skb->len) { + err = pskb_trim(skb, real_len); + if (err) + return err; + } hdr_len = skb_tcp_all_headers(skb); if (unlikely(skb->len == hdr_len)) { From 69172f0bcb6a09110c5d2a6d792627f5095a9018 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20=C5=BBenczykowski?= Date: Thu, 20 Jul 2023 09:00:22 -0700 Subject: [PATCH 18/49] ipv6 addrconf: fix bug where deleting a mngtmpaddr can create a new temporary address MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit currently on 6.4 net/main: # ip link add dummy1 type dummy # echo 1 > /proc/sys/net/ipv6/conf/dummy1/use_tempaddr # ip link set dummy1 up # ip -6 addr add 2000::1/64 mngtmpaddr dev dummy1 # ip -6 addr show dev dummy1 11: dummy1: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 inet6 2000::44f3:581c:8ca:3983/64 scope global temporary dynamic valid_lft 604800sec preferred_lft 86172sec inet6 2000::1/64 scope global mngtmpaddr valid_lft forever preferred_lft forever inet6 fe80::e8a8:a6ff:fed5:56d4/64 scope link valid_lft forever preferred_lft forever # ip -6 addr del 2000::44f3:581c:8ca:3983/64 dev dummy1 (can wait a few seconds if you want to, the above delete isn't [directly] the problem) # ip -6 addr show dev dummy1 11: dummy1: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 inet6 2000::1/64 scope global mngtmpaddr valid_lft forever preferred_lft forever inet6 fe80::e8a8:a6ff:fed5:56d4/64 scope link valid_lft forever preferred_lft forever # ip -6 addr del 2000::1/64 mngtmpaddr dev dummy1 # ip -6 addr show dev dummy1 11: dummy1: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 inet6 2000::81c9:56b7:f51a:b98f/64 scope global temporary dynamic valid_lft 604797sec preferred_lft 86169sec inet6 fe80::e8a8:a6ff:fed5:56d4/64 scope link valid_lft forever preferred_lft forever This patch prevents this new 'global temporary dynamic' address from being created by the deletion of the related (same subnet prefix) 'mngtmpaddr' (which is triggered by there already being no temporary addresses). Cc: Jiri Pirko Fixes: 53bd67491537 ("ipv6 addrconf: introduce IFA_F_MANAGETEMPADDR to tell kernel to manage temporary addresses") Reported-by: Xiao Ma Signed-off-by: Maciej Żenczykowski Reviewed-by: David Ahern Link: https://lore.kernel.org/r/20230720160022.1887942-1-maze@google.com Signed-off-by: Jakub Kicinski --- net/ipv6/addrconf.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index e5213e598a0408..94cec2075eee84 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2561,12 +2561,18 @@ static void manage_tempaddrs(struct inet6_dev *idev, ipv6_ifa_notify(0, ift); } - if ((create || list_empty(&idev->tempaddr_list)) && - idev->cnf.use_tempaddr > 0) { + /* Also create a temporary address if it's enabled but no temporary + * address currently exists. + * However, we get called with valid_lft == 0, prefered_lft == 0, create == false + * as part of cleanup (ie. deleting the mngtmpaddr). + * We don't want that to result in creating a new temporary ip address. + */ + if (list_empty(&idev->tempaddr_list) && (valid_lft || prefered_lft)) + create = true; + + if (create && idev->cnf.use_tempaddr > 0) { /* When a new public address is created as described * in [ADDRCONF], also create a new temporary address. - * Also create a temporary address if it's enabled but - * no temporary address currently exists. */ read_unlock_bh(&idev->lock); ipv6_create_tempaddr(ifp, false); From bb7a0156365dffe2fcd63e2051145fbe4f8908b4 Mon Sep 17 00:00:00 2001 From: Wei Fang Date: Fri, 21 Jul 2023 16:35:59 +0800 Subject: [PATCH 19/49] net: fec: avoid tx queue timeout when XDP is enabled According to the implementation of XDP of FEC driver, the XDP path shares the transmit queues with the kernel network stack, so it is possible to lead to a tx timeout event when XDP uses the tx queue pretty much exclusively. And this event will cause the reset of the FEC hardware. To avoid timeout in this case, we use the txq_trans_cond_update() interface to update txq->trans_start to jiffies so that watchdog won't generate a transmit timeout warning. Fixes: 6d6b39f180b8 ("net: fec: add initial XDP support") Signed-off-by: Wei Fang Link: https://lore.kernel.org/r/20230721083559.2857312-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/freescale/fec_main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index ec9e4bdb0c06be..073d616193363b 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -3916,6 +3916,8 @@ static int fec_enet_xdp_xmit(struct net_device *dev, __netif_tx_lock(nq, cpu); + /* Avoid tx timeout as XDP shares the queue with kernel stack */ + txq_trans_cond_update(nq); for (i = 0; i < num_frames; i++) { if (fec_enet_txq_xmit_frame(fep, txq, frames[i]) < 0) break; From d11b0df7ddf1831f3e170972f43186dad520bfcc Mon Sep 17 00:00:00 2001 From: Stewart Smith Date: Fri, 21 Jul 2023 15:24:10 -0700 Subject: [PATCH 20/49] tcp: Reduce chance of collisions in inet6_hashfn(). For both IPv4 and IPv6 incoming TCP connections are tracked in a hash table with a hash over the source & destination addresses and ports. However, the IPv6 hash is insufficient and can lead to a high rate of collisions. The IPv6 hash used an XOR to fit everything into the 96 bits for the fast jenkins hash, meaning it is possible for an external entity to ensure the hash collides, thus falling back to a linear search in the bucket, which is slow. We take the approach of hash the full length of IPv6 address in __ipv6_addr_jhash() so that all users can benefit from a more secure version. While this may look like it adds overhead, the reality of modern CPUs means that this is unmeasurable in real world scenarios. In simulating with llvm-mca, the increase in cycles for the hashing code was ~16 cycles on Skylake (from a base of ~155), and an extra ~9 on Nehalem (base of ~173). In commit dd6d2910c5e0 ("netfilter: conntrack: switch to siphash") netfilter switched from a jenkins hash to a siphash, but even the faster hsiphash is a more significant overhead (~20-30%) in some preliminary testing. So, in this patch, we keep to the more conservative approach to ensure we don't add much overhead per SYN. In testing, this results in a consistently even spread across the connection buckets. In both testing and real-world scenarios, we have not found any measurable performance impact. Fixes: 08dcdbf6a7b9 ("ipv6: use a stronger hash for tcp") Signed-off-by: Stewart Smith Signed-off-by: Samuel Mendoza-Jonas Suggested-by: Eric Dumazet Signed-off-by: Kuniyuki Iwashima Reviewed-by: Eric Dumazet Link: https://lore.kernel.org/r/20230721222410.17914-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski --- include/net/ipv6.h | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 7332296eca44b8..2acc4c808d45d1 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -752,12 +752,8 @@ static inline u32 ipv6_addr_hash(const struct in6_addr *a) /* more secured version of ipv6_addr_hash() */ static inline u32 __ipv6_addr_jhash(const struct in6_addr *a, const u32 initval) { - u32 v = (__force u32)a->s6_addr32[0] ^ (__force u32)a->s6_addr32[1]; - - return jhash_3words(v, - (__force u32)a->s6_addr32[2], - (__force u32)a->s6_addr32[3], - initval); + return jhash2((__force const u32 *)a->s6_addr32, + ARRAY_SIZE(a->s6_addr32), initval); } static inline bool ipv6_addr_loopback(const struct in6_addr *a) From a3336056504d780590ac6d6ac94fbba829994594 Mon Sep 17 00:00:00 2001 From: Jedrzej Jagielski Date: Fri, 21 Jul 2023 08:58:54 -0700 Subject: [PATCH 21/49] ice: Fix memory management in ice_ethtool_fdir.c Fix ethtool FDIR logic to not use memory after its release. In the ice_ethtool_fdir.c file there are 2 spots where code can refer to pointers which may be missing. In the ice_cfg_fdir_xtrct_seq() function seg may be freed but even then may be still used by memcpy(&tun_seg[1], seg, sizeof(*seg)). In the ice_add_fdir_ethtool() function struct ice_fdir_fltr *input may first fail to be added via ice_fdir_update_list_entry() but then may be deleted by ice_fdir_update_list_entry. Terminate in both cases when the returned value of the previous operation is other than 0, free memory and don't use it anymore. Reported-by: Michal Schmidt Link: https://bugzilla.redhat.com/show_bug.cgi?id=2208423 Fixes: cac2a27cd9ab ("ice: Support IPv4 Flow Director filters") Reviewed-by: Przemek Kitszel Signed-off-by: Jedrzej Jagielski Reviewed-by: Leon Romanovsky Tested-by: Pucha Himasekhar Reddy (A Contingent worker at Intel) Signed-off-by: Tony Nguyen Link: https://lore.kernel.org/r/20230721155854.1292805-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski --- .../net/ethernet/intel/ice/ice_ethtool_fdir.c | 26 ++++++++++--------- 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool_fdir.c b/drivers/net/ethernet/intel/ice/ice_ethtool_fdir.c index ead6d50fc0adca..8c6e13f87b7d3f 100644 --- a/drivers/net/ethernet/intel/ice/ice_ethtool_fdir.c +++ b/drivers/net/ethernet/intel/ice/ice_ethtool_fdir.c @@ -1281,16 +1281,21 @@ ice_cfg_fdir_xtrct_seq(struct ice_pf *pf, struct ethtool_rx_flow_spec *fsp, ICE_FLOW_FLD_OFF_INVAL); } - /* add filter for outer headers */ fltr_idx = ice_ethtool_flow_to_fltr(fsp->flow_type & ~FLOW_EXT); + + assign_bit(fltr_idx, hw->fdir_perfect_fltr, perfect_filter); + + /* add filter for outer headers */ ret = ice_fdir_set_hw_fltr_rule(pf, seg, fltr_idx, ICE_FD_HW_SEG_NON_TUN); - if (ret == -EEXIST) - /* Rule already exists, free memory and continue */ - devm_kfree(dev, seg); - else if (ret) + if (ret == -EEXIST) { + /* Rule already exists, free memory and count as success */ + ret = 0; + goto err_exit; + } else if (ret) { /* could not write filter, free memory */ goto err_exit; + } /* make tunneled filter HW entries if possible */ memcpy(&tun_seg[1], seg, sizeof(*seg)); @@ -1305,18 +1310,13 @@ ice_cfg_fdir_xtrct_seq(struct ice_pf *pf, struct ethtool_rx_flow_spec *fsp, devm_kfree(dev, tun_seg); } - if (perfect_filter) - set_bit(fltr_idx, hw->fdir_perfect_fltr); - else - clear_bit(fltr_idx, hw->fdir_perfect_fltr); - return ret; err_exit: devm_kfree(dev, tun_seg); devm_kfree(dev, seg); - return -EOPNOTSUPP; + return ret; } /** @@ -1914,7 +1914,9 @@ int ice_add_fdir_ethtool(struct ice_vsi *vsi, struct ethtool_rxnfc *cmd) input->comp_report = ICE_FXD_FLTR_QW0_COMP_REPORT_SW_FAIL; /* input struct is added to the HW filter list */ - ice_fdir_update_list_entry(pf, input, fsp->location); + ret = ice_fdir_update_list_entry(pf, input, fsp->location); + if (ret) + goto release_lock; ret = ice_fdir_write_all_fltr(pf, input, true); if (ret) From da19a2b967cf1e2c426f50d28550d1915214a81d Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Fri, 21 Jul 2023 12:03:55 +0800 Subject: [PATCH 22/49] bonding: reset bond's flags when down link is P2P device When adding a point to point downlink to the bond, we neglected to reset the bond's flags, which were still using flags like BROADCAST and MULTICAST. Consequently, this would initiate ARP/DAD for P2P downlink interfaces, such as when adding a GRE device to the bonding. To address this issue, let's reset the bond's flags for P2P interfaces. Before fix: 7: gre0@NONE: mtu 1500 qdisc noqueue master bond0 state UNKNOWN group default qlen 1000 link/gre6 2006:70:10::1 peer 2006:70:10::2 permaddr 167f:18:f188:: 8: bond0: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/gre6 2006:70:10::1 brd 2006:70:10::2 inet6 fe80::200:ff:fe00:0/64 scope link valid_lft forever preferred_lft forever After fix: 7: gre0@NONE: mtu 1500 qdisc noqueue master bond2 state UNKNOWN group default qlen 1000 link/gre6 2006:70:10::1 peer 2006:70:10::2 permaddr c29e:557a:e9d9:: 8: bond0: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/gre6 2006:70:10::1 peer 2006:70:10::2 inet6 fe80::1/64 scope link valid_lft forever preferred_lft forever Reported-by: Liang Li Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2221438 Fixes: 872254dd6b1f ("net/bonding: Enable bonding to enslave non ARPHRD_ETHER") Signed-off-by: Hangbin Liu Signed-off-by: Paolo Abeni --- drivers/net/bonding/bond_main.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 7a0f25301f7ec2..484c9e3e5e8252 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1508,6 +1508,11 @@ static void bond_setup_by_slave(struct net_device *bond_dev, memcpy(bond_dev->broadcast, slave_dev->broadcast, slave_dev->addr_len); + + if (slave_dev->flags & IFF_POINTOPOINT) { + bond_dev->flags &= ~(IFF_BROADCAST | IFF_MULTICAST); + bond_dev->flags |= (IFF_POINTOPOINT | IFF_NOARP); + } } /* On bonding slaves other than the currently active slave, suppress From fa532bee17d15acf8bba4bc8e2062b7a093ba801 Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Fri, 21 Jul 2023 12:03:56 +0800 Subject: [PATCH 23/49] team: reset team's flags when down link is P2P device When adding a point to point downlink to team device, we neglected to reset the team's flags, which were still using flags like BROADCAST and MULTICAST. Consequently, this would initiate ARP/DAD for P2P downlink interfaces, such as when adding a GRE device to team device. Fix this by remove multicast/broadcast flags and add p2p and noarp flags. After removing the none ethernet interface and adding an ethernet interface to team, we need to reset team interface flags. Unlike bonding interface, team do not need restore IFF_MASTER, IFF_SLAVE flags. Reported-by: Liang Li Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2221438 Fixes: 1d76efe1577b ("team: add support for non-ethernet devices") Signed-off-by: Hangbin Liu Signed-off-by: Paolo Abeni --- drivers/net/team/team.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index 555b0b1e9a7893..d3dc22509ea582 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -2135,6 +2135,15 @@ static void team_setup_by_port(struct net_device *dev, dev->mtu = port_dev->mtu; memcpy(dev->broadcast, port_dev->broadcast, port_dev->addr_len); eth_hw_addr_inherit(dev, port_dev); + + if (port_dev->flags & IFF_POINTOPOINT) { + dev->flags &= ~(IFF_BROADCAST | IFF_MULTICAST); + dev->flags |= (IFF_POINTOPOINT | IFF_NOARP); + } else if ((port_dev->flags & (IFF_BROADCAST | IFF_MULTICAST)) == + (IFF_BROADCAST | IFF_MULTICAST)) { + dev->flags |= (IFF_BROADCAST | IFF_MULTICAST); + dev->flags &= ~(IFF_POINTOPOINT | IFF_NOARP); + } } static int team_dev_type_check_change(struct net_device *dev, From 4e62c99d71e56817c934caa2a709a775c8cee078 Mon Sep 17 00:00:00 2001 From: Suman Ghosh Date: Fri, 21 Jul 2023 11:42:22 +0530 Subject: [PATCH 24/49] octeontx2-af: Fix hash extraction enable configuration As of today, hash extraction support is enabled for all the silicons. Because of which we are facing initialization issues when the silicon does not support hash extraction. During creation of the hardware parsing table for IPv6 address, we need to consider if hash extraction is enabled then extract only 32 bit, otherwise 128 bit needs to be extracted. This patch fixes the issue and configures the hardware parser based on the availability of the feature. Fixes: a95ab93550d3 ("octeontx2-af: Use hashed field in MCAM key") Signed-off-by: Suman Ghosh Reviewed-by: Simon Horman Link: https://lore.kernel.org/r/20230721061222.2632521-1-sumang@marvell.com Signed-off-by: Paolo Abeni --- .../marvell/octeontx2/af/rvu_npc_hash.c | 43 ++++++++++++++++++- .../marvell/octeontx2/af/rvu_npc_hash.h | 8 ++-- 2 files changed, 46 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_hash.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_hash.c index 6fe67f3a7f6f18..7e20282c12d00b 100644 --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_hash.c +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_hash.c @@ -218,13 +218,54 @@ void npc_config_secret_key(struct rvu *rvu, int blkaddr) void npc_program_mkex_hash(struct rvu *rvu, int blkaddr) { + struct npc_mcam_kex_hash *mh = rvu->kpu.mkex_hash; struct hw_cap *hwcap = &rvu->hw->cap; + u8 intf, ld, hdr_offset, byte_len; struct rvu_hwinfo *hw = rvu->hw; - u8 intf; + u64 cfg; + /* Check if hardware supports hash extraction */ if (!hwcap->npc_hash_extract) return; + /* Check if IPv6 source/destination address + * should be hash enabled. + * Hashing reduces 128bit SIP/DIP fields to 32bit + * so that 224 bit X2 key can be used for IPv6 based filters as well, + * which in turn results in more number of MCAM entries available for + * use. + * + * Hashing of IPV6 SIP/DIP is enabled in below scenarios + * 1. If the silicon variant supports hashing feature + * 2. If the number of bytes of IP addr being extracted is 4 bytes ie + * 32bit. The assumption here is that if user wants 8bytes of LSB of + * IP addr or full 16 bytes then his intention is not to use 32bit + * hash. + */ + for (intf = 0; intf < hw->npc_intfs; intf++) { + for (ld = 0; ld < NPC_MAX_LD; ld++) { + cfg = rvu_read64(rvu, blkaddr, + NPC_AF_INTFX_LIDX_LTX_LDX_CFG(intf, + NPC_LID_LC, + NPC_LT_LC_IP6, + ld)); + hdr_offset = FIELD_GET(NPC_HDR_OFFSET, cfg); + byte_len = FIELD_GET(NPC_BYTESM, cfg); + /* Hashing of IPv6 source/destination address should be + * enabled if, + * hdr_offset == 8 (offset of source IPv6 address) or + * hdr_offset == 24 (offset of destination IPv6) + * address) and the number of byte to be + * extracted is 4. As per hardware configuration + * byte_len should be == actual byte_len - 1. + * Hence byte_len is checked against 3 but nor 4. + */ + if ((hdr_offset == 8 || hdr_offset == 24) && byte_len == 3) + mh->lid_lt_ld_hash_en[intf][NPC_LID_LC][NPC_LT_LC_IP6][ld] = true; + } + } + + /* Update hash configuration if the field is hash enabled */ for (intf = 0; intf < hw->npc_intfs; intf++) { npc_program_mkex_hash_rx(rvu, blkaddr, intf); npc_program_mkex_hash_tx(rvu, blkaddr, intf); diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_hash.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_hash.h index a1c3d987b8044e..57a09328d46b5a 100644 --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_hash.h +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_hash.h @@ -70,8 +70,8 @@ static struct npc_mcam_kex_hash npc_mkex_hash_default __maybe_unused = { [NIX_INTF_RX] = { [NPC_LID_LC] = { [NPC_LT_LC_IP6] = { - true, - true, + false, + false, }, }, }, @@ -79,8 +79,8 @@ static struct npc_mcam_kex_hash npc_mkex_hash_default __maybe_unused = { [NIX_INTF_TX] = { [NPC_LID_LC] = { [NPC_LT_LC_IP6] = { - true, - true, + false, + false, }, }, }, From 284779dbf4e98753458708783af8c35630674a21 Mon Sep 17 00:00:00 2001 From: Vincent Whitchurch Date: Fri, 21 Jul 2023 15:39:20 +0200 Subject: [PATCH 25/49] net: stmmac: Apply redundant write work around on 4.xx too commit a3a57bf07de23fe1ff779e0fdf710aa581c3ff73 ("net: stmmac: work around sporadic tx issue on link-up") worked around a problem with TX sometimes not working after a link-up by avoiding a redundant write to MAC_CTRL_REG (aka GMAC_CONFIG), since the IP appeared to have problems with handling multiple writes to that register in some cases. That commit however only added the work around to dwmac_lib.c (apart from the common code in stmmac_main.c), but my systems with version 4.21a of the IP exhibit the same problem, so add the work around to dwmac4_lib.c too. Fixes: a3a57bf07de2 ("net: stmmac: work around sporadic tx issue on link-up") Signed-off-by: Vincent Whitchurch Reviewed-by: Simon Horman Link: https://lore.kernel.org/r/20230721-stmmac-tx-workaround-v1-1-9411cbd5ee07@axis.com Signed-off-by: Paolo Abeni --- drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c index df41eac54058f7..03ceb6a940732d 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c @@ -240,13 +240,15 @@ void stmmac_dwmac4_set_mac_addr(void __iomem *ioaddr, const u8 addr[6], void stmmac_dwmac4_set_mac(void __iomem *ioaddr, bool enable) { u32 value = readl(ioaddr + GMAC_CONFIG); + u32 old_val = value; if (enable) value |= GMAC_CONFIG_RE | GMAC_CONFIG_TE; else value &= ~(GMAC_CONFIG_TE | GMAC_CONFIG_RE); - writel(value, ioaddr + GMAC_CONFIG); + if (value != old_val) + writel(value, ioaddr + GMAC_CONFIG); } void stmmac_dwmac4_get_mac_addr(void __iomem *ioaddr, unsigned char *addr, From 55cef78c244d0d076f5a75a35530ca63c92f4426 Mon Sep 17 00:00:00 2001 From: Lin Ma Date: Sun, 23 Jul 2023 16:02:05 +0800 Subject: [PATCH 26/49] macvlan: add forgotten nla_policy for IFLA_MACVLAN_BC_CUTOFF The previous commit 954d1fa1ac93 ("macvlan: Add netlink attribute for broadcast cutoff") added one additional attribute named IFLA_MACVLAN_BC_CUTOFF to allow broadcast cutfoff. However, it forgot to describe the nla_policy at macvlan_policy (drivers/net/macvlan.c). Hence, this suppose NLA_S32 (4 bytes) integer can be faked as empty (0 bytes) by a malicious user, which could leads to OOB in heap just like CVE-2023-3773. To fix it, this commit just completes the nla_policy description for IFLA_MACVLAN_BC_CUTOFF. This enforces the length check and avoids the potential OOB read. Fixes: 954d1fa1ac93 ("macvlan: Add netlink attribute for broadcast cutoff") Signed-off-by: Lin Ma Reviewed-by: Simon Horman Link: https://lore.kernel.org/r/20230723080205.3715164-1-linma@zju.edu.cn Signed-off-by: Jakub Kicinski --- drivers/net/macvlan.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 4a53debf9d7c48..ed908165a8b4ea 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -1746,6 +1746,7 @@ static const struct nla_policy macvlan_policy[IFLA_MACVLAN_MAX + 1] = { [IFLA_MACVLAN_MACADDR_COUNT] = { .type = NLA_U32 }, [IFLA_MACVLAN_BC_QUEUE_LEN] = { .type = NLA_U32 }, [IFLA_MACVLAN_BC_QUEUE_LEN_USED] = { .type = NLA_REJECT }, + [IFLA_MACVLAN_BC_CUTOFF] = { .type = NLA_S32 }, }; int macvlan_link_register(struct rtnl_link_ops *ops) From 06d4c8a80836d76ffec4f71ddb3de94ee93ab40d Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Mon, 24 Jul 2023 14:34:24 -0700 Subject: [PATCH 27/49] af_unix: Fix fortify_panic() in unix_bind_bsd(). syzkaller found a bug in unix_bind_bsd() [0]. We can reproduce it by bind()ing a socket on a path with length 108. 108 is the size of sun_addr of struct sockaddr_un and is the maximum valid length for the pathname socket. When calling bind(), we use struct sockaddr_storage as the actual buffer size, so terminating sun_addr[108] with null is legitimate as done in unix_mkname_bsd(). However, strlen(sunaddr) for such a case causes fortify_panic() if CONFIG_FORTIFY_SOURCE=y. __fortify_strlen() has no idea about the actual buffer size and see the string as unterminated. Let's use strnlen() to allow sun_addr to be unterminated at 107. [0]: detected buffer overflow in __fortify_strlen kernel BUG at lib/string_helpers.c:1031! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 255 Comm: syz-executor296 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #4 Hardware name: linux,dummy-virt (DT) pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : fortify_panic+0x1c/0x20 lib/string_helpers.c:1030 lr : fortify_panic+0x1c/0x20 lib/string_helpers.c:1030 sp : ffff800089817af0 x29: ffff800089817af0 x28: ffff800089817b40 x27: 1ffff00011302f68 x26: 000000000000006e x25: 0000000000000012 x24: ffff800087e60140 x23: dfff800000000000 x22: ffff800089817c20 x21: ffff800089817c8e x20: 000000000000006c x19: ffff00000c323900 x18: ffff800086ab1630 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000001 x14: 1ffff00011302eb8 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 64a26b65474d2a00 x8 : 64a26b65474d2a00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff800089817438 x4 : ffff800086ac99e0 x3 : ffff800080f19e8c x2 : 0000000000000001 x1 : 0000000100000000 x0 : 000000000000002c Call trace: fortify_panic+0x1c/0x20 lib/string_helpers.c:1030 _Z16__fortify_strlenPKcU25pass_dynamic_object_size1 include/linux/fortify-string.h:217 [inline] unix_bind_bsd net/unix/af_unix.c:1212 [inline] unix_bind+0xba8/0xc58 net/unix/af_unix.c:1326 __sys_bind+0x1ac/0x248 net/socket.c:1792 __do_sys_bind net/socket.c:1803 [inline] __se_sys_bind net/socket.c:1801 [inline] __arm64_sys_bind+0x7c/0x94 net/socket.c:1801 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall+0x98/0x2c0 arch/arm64/kernel/syscall.c:52 el0_svc_common+0x134/0x240 arch/arm64/kernel/syscall.c:139 do_el0_svc+0x64/0x198 arch/arm64/kernel/syscall.c:188 el0_svc+0x2c/0x7c arch/arm64/kernel/entry-common.c:647 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:665 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591 Code: aa0003e1 d0000e80 91030000 97ffc91a (d4210000) Fixes: df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3") Reported-by: syzkaller Suggested-by: Kees Cook Signed-off-by: Kuniyuki Iwashima Link: https://lore.kernel.org/r/20230724213425.22920-2-kuniyu@amazon.com Reviewed-by: Simon Horman Reviewed-by: Kees Cook Signed-off-by: Jakub Kicinski --- net/unix/af_unix.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 123b35ddfd71fa..bbacf4c60fe3ac 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1208,10 +1208,8 @@ static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr, struct path parent; int err; - unix_mkname_bsd(sunaddr, addr_len); - addr_len = strlen(sunaddr->sun_path) + - offsetof(struct sockaddr_un, sun_path) + 1; - + addr_len = strnlen(sunaddr->sun_path, sizeof(sunaddr->sun_path)) + + offsetof(struct sockaddr_un, sun_path) + 1; addr = unix_create_addr(sunaddr, addr_len); if (!addr) return -ENOMEM; From a0ade8404c3bc2bf2631cb0f20d372eed22d9d96 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Mon, 24 Jul 2023 14:34:25 -0700 Subject: [PATCH 28/49] af_packet: Fix warning of fortified memcpy() in packet_getname(). syzkaller found a warning in packet_getname() [0], where we try to copy 16 bytes to sockaddr_ll.sll_addr[8]. Some devices (ip6gre, vti6, ip6tnl) have 16 bytes address expressed by struct in6_addr. Also, Infiniband has 32 bytes as MAX_ADDR_LEN. The write seems to overflow, but actually not since we use struct sockaddr_storage defined in __sys_getsockname() and its size is 128 (_K_SS_MAXSIZE) bytes. Thus, we have sufficient room after sll_addr[] as __data[]. To avoid the warning, let's add a flex array member union-ed with sll_addr. Another option would be to use strncpy() and limit the copied length to sizeof(sll_addr), but it will return the partial address and break an application that passes sockaddr_storage to getsockname(). [0]: memcpy: detected field-spanning write (size 16) of single field "sll->sll_addr" at net/packet/af_packet.c:3604 (size 8) WARNING: CPU: 0 PID: 255 at net/packet/af_packet.c:3604 packet_getname+0x25c/0x3a0 net/packet/af_packet.c:3604 Modules linked in: CPU: 0 PID: 255 Comm: syz-executor750 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #4 Hardware name: linux,dummy-virt (DT) pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : packet_getname+0x25c/0x3a0 net/packet/af_packet.c:3604 lr : packet_getname+0x25c/0x3a0 net/packet/af_packet.c:3604 sp : ffff800089887bc0 x29: ffff800089887bc0 x28: ffff000010f80f80 x27: 0000000000000003 x26: dfff800000000000 x25: ffff700011310f80 x24: ffff800087d55000 x23: dfff800000000000 x22: ffff800089887c2c x21: 0000000000000010 x20: ffff00000de08310 x19: ffff800089887c20 x18: ffff800086ab1630 x17: 20646c6569662065 x16: 6c676e697320666f x15: 0000000000000001 x14: 1fffe0000d56d7ca x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 3e60944c3da92b00 x8 : 3e60944c3da92b00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8000898874f8 x4 : ffff800086ac99e0 x3 : ffff8000803f8808 x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000000 Call trace: packet_getname+0x25c/0x3a0 net/packet/af_packet.c:3604 __sys_getsockname+0x168/0x24c net/socket.c:2042 __do_sys_getsockname net/socket.c:2057 [inline] __se_sys_getsockname net/socket.c:2054 [inline] __arm64_sys_getsockname+0x7c/0x94 net/socket.c:2054 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall+0x98/0x2c0 arch/arm64/kernel/syscall.c:52 el0_svc_common+0x134/0x240 arch/arm64/kernel/syscall.c:139 do_el0_svc+0x64/0x198 arch/arm64/kernel/syscall.c:188 el0_svc+0x2c/0x7c arch/arm64/kernel/entry-common.c:647 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:665 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591 Fixes: df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3") Reported-by: syzkaller Suggested-by: Kees Cook Signed-off-by: Kuniyuki Iwashima Link: https://lore.kernel.org/r/20230724213425.22920-3-kuniyu@amazon.com Reviewed-by: Simon Horman Reviewed-by: Kees Cook Signed-off-by: Jakub Kicinski --- include/uapi/linux/if_packet.h | 6 +++++- net/packet/af_packet.c | 2 +- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h index 9efc42382fdb98..4d0ad22f83b56b 100644 --- a/include/uapi/linux/if_packet.h +++ b/include/uapi/linux/if_packet.h @@ -18,7 +18,11 @@ struct sockaddr_ll { unsigned short sll_hatype; unsigned char sll_pkttype; unsigned char sll_halen; - unsigned char sll_addr[8]; + union { + unsigned char sll_addr[8]; + /* Actual length is in sll_halen. */ + __DECLARE_FLEX_ARRAY(unsigned char, sll_addr_flex); + }; }; /* Packet types */ diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 85ff90a03b0c34..8e3ddec4c3d57e 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -3601,7 +3601,7 @@ static int packet_getname(struct socket *sock, struct sockaddr *uaddr, if (dev) { sll->sll_hatype = dev->type; sll->sll_halen = dev->addr_len; - memcpy(sll->sll_addr, dev->dev_addr, dev->addr_len); + memcpy(sll->sll_addr_flex, dev->dev_addr, dev->addr_len); } else { sll->sll_hatype = 0; /* Bad: we have no ARPHRD_UNSPEC */ sll->sll_halen = 0; From e11ec2b868af2b351c6c1e2e50eb711cc5423a10 Mon Sep 17 00:00:00 2001 From: Alex Elder Date: Mon, 24 Jul 2023 17:40:55 -0500 Subject: [PATCH 29/49] net: ipa: only reset hashed tables when supported Last year, the code that manages GSI channel transactions switched from using spinlock-protected linked lists to using indexes into the ring buffer used for a channel. Recently, Google reported seeing transaction reference count underflows occasionally during shutdown. Doug Anderson found a way to reproduce the issue reliably, and bisected the issue to the commit that eliminated the linked lists and the lock. The root cause was ultimately determined to be related to unused transactions being committed as part of the modem shutdown cleanup activity. Unused transactions are not normally expected (except in error cases). The modem uses some ranges of IPA-resident memory, and whenever it shuts down we zero those ranges. In ipa_filter_reset_table() a transaction is allocated to zero modem filter table entries. If hashing is not supported, hashed table memory should not be zeroed. But currently nothing prevents that, and the result is an unused transaction. Something similar occurs when we zero routing table entries for the modem. By preventing any attempt to clear hashed tables when hashing is not supported, the reference count underflow is avoided in this case. Note that there likely remains an issue with properly freeing unused transactions (if they occur due to errors). This patch addresses only the underflows that Google originally reported. Cc: # 6.1.x Fixes: d338ae28d8a8 ("net: ipa: kill all other transaction lists") Tested-by: Douglas Anderson Signed-off-by: Alex Elder Link: https://lore.kernel.org/r/20230724224055.1688854-1-elder@linaro.org Signed-off-by: Jakub Kicinski --- drivers/net/ipa/ipa_table.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/net/ipa/ipa_table.c b/drivers/net/ipa/ipa_table.c index f0529c31d0b6e5..7b637bb8b41c87 100644 --- a/drivers/net/ipa/ipa_table.c +++ b/drivers/net/ipa/ipa_table.c @@ -273,16 +273,15 @@ static int ipa_filter_reset(struct ipa *ipa, bool modem) if (ret) return ret; - ret = ipa_filter_reset_table(ipa, true, false, modem); - if (ret) + ret = ipa_filter_reset_table(ipa, false, true, modem); + if (ret || !ipa_table_hash_support(ipa)) return ret; - ret = ipa_filter_reset_table(ipa, false, true, modem); + ret = ipa_filter_reset_table(ipa, true, false, modem); if (ret) return ret; - ret = ipa_filter_reset_table(ipa, true, true, modem); - return ret; + return ipa_filter_reset_table(ipa, true, true, modem); } /* The AP routes and modem routes are each contiguous within the @@ -291,12 +290,13 @@ static int ipa_filter_reset(struct ipa *ipa, bool modem) * */ static int ipa_route_reset(struct ipa *ipa, bool modem) { + bool hash_support = ipa_table_hash_support(ipa); u32 modem_route_count = ipa->modem_route_count; struct gsi_trans *trans; u16 first; u16 count; - trans = ipa_cmd_trans_alloc(ipa, 4); + trans = ipa_cmd_trans_alloc(ipa, hash_support ? 4 : 2); if (!trans) { dev_err(&ipa->pdev->dev, "no transaction for %s route reset\n", @@ -313,10 +313,12 @@ static int ipa_route_reset(struct ipa *ipa, bool modem) } ipa_table_reset_add(trans, false, false, false, first, count); - ipa_table_reset_add(trans, false, true, false, first, count); - ipa_table_reset_add(trans, false, false, true, first, count); - ipa_table_reset_add(trans, false, true, true, first, count); + + if (hash_support) { + ipa_table_reset_add(trans, false, true, false, first, count); + ipa_table_reset_add(trans, false, true, true, first, count); + } gsi_trans_commit_wait(trans); From 2c39dd025da489cf87d26469d9f5ff19715324a0 Mon Sep 17 00:00:00 2001 From: Christian Marangi Date: Mon, 24 Jul 2023 05:25:28 +0200 Subject: [PATCH 30/49] net: dsa: qca8k: enable use_single_write for qca8xxx The qca8xxx switch supports 2 way to write reg values, a slow way using mdio and a fast way by sending specially crafted mgmt packet to read/write reg. The fast way can support up to 32 bytes of data as eth packet are used to send/receive. This correctly works for almost the entire regmap of the switch but with the use of some kernel selftests for dsa drivers it was found a funny and interesting hw defect/limitation. For some specific reg, bulk write won't work and will result in writing only part of the requested regs resulting in half data written. This was especially hard to track and discover due to the total strangeness of the problem and also by the specific regs where this occurs. This occurs in the specific regs of the ATU table, where multiple entry needs to be written to compose the entire entry. It was discovered that with a bulk write of 12 bytes on QCA8K_REG_ATU_DATA0 only QCA8K_REG_ATU_DATA0 and QCA8K_REG_ATU_DATA2 were written, but QCA8K_REG_ATU_DATA1 was always zero. Tcpdump was used to make sure the specially crafted packet was correct and this was confirmed. The problem was hard to track as the lack of QCA8K_REG_ATU_DATA1 resulted in an entry somehow possible as the first bytes of the mac address are set in QCA8K_REG_ATU_DATA0 and the entry type is set in QCA8K_REG_ATU_DATA2. Funlly enough writing QCA8K_REG_ATU_DATA1 results in the same problem with QCA8K_REG_ATU_DATA2 empty and QCA8K_REG_ATU_DATA1 and QCA8K_REG_ATU_FUNC correctly written. A speculation on the problem might be that there are some kind of indirection internally when accessing these regs and they can't be accessed all together, due to the fact that it's really a table mapped somewhere in the switch SRAM. Even more funny is the fact that every other reg was tested with all kind of combination and they are not affected by this problem. Read operation was also tested and always worked so it's not affected by this problem. The problem is not present if we limit writing a single reg at times. To handle this hardware defect, enable use_single_write so that bulk api can correctly split the write in multiple different operation effectively reverting to a non-bulk write. Cc: Mark Brown Fixes: c766e077d927 ("net: dsa: qca8k: convert to regmap read/write API") Signed-off-by: Christian Marangi Cc: stable@vger.kernel.org Signed-off-by: David S. Miller --- drivers/net/dsa/qca/qca8k-8xxx.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/dsa/qca/qca8k-8xxx.c b/drivers/net/dsa/qca/qca8k-8xxx.c index 09b80644c11bd6..efe9380d4a15d2 100644 --- a/drivers/net/dsa/qca/qca8k-8xxx.c +++ b/drivers/net/dsa/qca/qca8k-8xxx.c @@ -576,8 +576,11 @@ static struct regmap_config qca8k_regmap_config = { .rd_table = &qca8k_readable_table, .disable_locking = true, /* Locking is handled by qca8k read/write */ .cache_type = REGCACHE_NONE, /* Explicitly disable CACHE */ - .max_raw_read = 32, /* mgmt eth can read/write up to 8 registers at time */ - .max_raw_write = 32, + .max_raw_read = 32, /* mgmt eth can read up to 8 registers at time */ + /* ATU regs suffer from a bug where some data are not correctly + * written. Disable bulk write to correctly write ATU entry. + */ + .use_single_write = true, }; static int From 80248d4160894d7e40b04111bdbaa4ff93fc4bd7 Mon Sep 17 00:00:00 2001 From: Christian Marangi Date: Mon, 24 Jul 2023 05:25:29 +0200 Subject: [PATCH 31/49] net: dsa: qca8k: fix search_and_insert wrong handling of new rule On inserting a mdb entry, fdb_search_and_insert is used to add a port to the qca8k target entry in the FDB db. A FDB entry can't be modified so it needs to be removed and insert again with the new values. To detect if an entry already exist, the SEARCH operation is used and we check the aging of the entry. If the entry is not 0, the entry exist and we proceed to delete it. Current code have 2 main problem: - The condition to check if the FDB entry exist is wrong and should be the opposite. - When a FDB entry doesn't exist, aging was never actually set to the STATIC value resulting in allocating an invalid entry. Fix both problem by adding aging support to the function, calling the function with STATIC as aging by default and finally by correct the condition to check if the entry actually exist. Fixes: ba8f870dfa63 ("net: dsa: qca8k: add support for mdb_add/del") Signed-off-by: Christian Marangi Cc: stable@vger.kernel.org Signed-off-by: David S. Miller --- drivers/net/dsa/qca/qca8k-common.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/net/dsa/qca/qca8k-common.c b/drivers/net/dsa/qca/qca8k-common.c index 8c2dc0e48ff4b9..4909603d07c8e9 100644 --- a/drivers/net/dsa/qca/qca8k-common.c +++ b/drivers/net/dsa/qca/qca8k-common.c @@ -244,7 +244,7 @@ void qca8k_fdb_flush(struct qca8k_priv *priv) } static int qca8k_fdb_search_and_insert(struct qca8k_priv *priv, u8 port_mask, - const u8 *mac, u16 vid) + const u8 *mac, u16 vid, u8 aging) { struct qca8k_fdb fdb = { 0 }; int ret; @@ -261,10 +261,12 @@ static int qca8k_fdb_search_and_insert(struct qca8k_priv *priv, u8 port_mask, goto exit; /* Rule exist. Delete first */ - if (!fdb.aging) { + if (fdb.aging) { ret = qca8k_fdb_access(priv, QCA8K_FDB_PURGE, -1); if (ret) goto exit; + } else { + fdb.aging = aging; } /* Add port to fdb portmask */ @@ -810,7 +812,8 @@ int qca8k_port_mdb_add(struct dsa_switch *ds, int port, const u8 *addr = mdb->addr; u16 vid = mdb->vid; - return qca8k_fdb_search_and_insert(priv, BIT(port), addr, vid); + return qca8k_fdb_search_and_insert(priv, BIT(port), addr, vid, + QCA8K_ATU_STATUS_STATIC); } int qca8k_port_mdb_del(struct dsa_switch *ds, int port, From ae70dcb9d9ecaf7d9836d3e1b5bef654d7ef5680 Mon Sep 17 00:00:00 2001 From: Christian Marangi Date: Mon, 24 Jul 2023 05:25:30 +0200 Subject: [PATCH 32/49] net: dsa: qca8k: fix broken search_and_del On deleting an MDB entry for a port, fdb_search_and_del is used. An FDB entry can't be modified so it needs to be deleted and readded again with the new portmap (and the port deleted as requested) We use the SEARCH operator to search the entry to edit by vid and mac address and then we check the aging if we actually found an entry. Currently the code suffer from a bug where the searched fdb entry is never read again with the found values (if found) resulting in the code always returning -EINVAL as aging was always 0. Fix this by correctly read the fdb entry after it was searched. Fixes: ba8f870dfa63 ("net: dsa: qca8k: add support for mdb_add/del") Signed-off-by: Christian Marangi Cc: stable@vger.kernel.org Signed-off-by: David S. Miller --- drivers/net/dsa/qca/qca8k-common.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/net/dsa/qca/qca8k-common.c b/drivers/net/dsa/qca/qca8k-common.c index 4909603d07c8e9..b644c05337c513 100644 --- a/drivers/net/dsa/qca/qca8k-common.c +++ b/drivers/net/dsa/qca/qca8k-common.c @@ -293,6 +293,10 @@ static int qca8k_fdb_search_and_del(struct qca8k_priv *priv, u8 port_mask, if (ret < 0) goto exit; + ret = qca8k_fdb_read(priv, &fdb); + if (ret < 0) + goto exit; + /* Rule doesn't exist. Why delete? */ if (!fdb.aging) { ret = -EINVAL; From dfd739f182b00b02bd7470ed94d112684cc04fa2 Mon Sep 17 00:00:00 2001 From: Christian Marangi Date: Mon, 24 Jul 2023 05:25:31 +0200 Subject: [PATCH 33/49] net: dsa: qca8k: fix mdb add/del case with 0 VID The qca8k switch doesn't support using 0 as VID and require a default VID to be always set. MDB add/del function doesn't currently handle this and are currently setting the default VID. Fix this by correctly handling this corner case and internally use the default VID for VID 0 case. Fixes: ba8f870dfa63 ("net: dsa: qca8k: add support for mdb_add/del") Signed-off-by: Christian Marangi Cc: stable@vger.kernel.org Signed-off-by: David S. Miller --- drivers/net/dsa/qca/qca8k-common.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/dsa/qca/qca8k-common.c b/drivers/net/dsa/qca/qca8k-common.c index b644c05337c513..13b8452ce5b2bc 100644 --- a/drivers/net/dsa/qca/qca8k-common.c +++ b/drivers/net/dsa/qca/qca8k-common.c @@ -816,6 +816,9 @@ int qca8k_port_mdb_add(struct dsa_switch *ds, int port, const u8 *addr = mdb->addr; u16 vid = mdb->vid; + if (!vid) + vid = QCA8K_PORT_VID_DEF; + return qca8k_fdb_search_and_insert(priv, BIT(port), addr, vid, QCA8K_ATU_STATUS_STATIC); } @@ -828,6 +831,9 @@ int qca8k_port_mdb_del(struct dsa_switch *ds, int port, const u8 *addr = mdb->addr; u16 vid = mdb->vid; + if (!vid) + vid = QCA8K_PORT_VID_DEF; + return qca8k_fdb_search_and_del(priv, BIT(port), addr, vid); } From d4a7ce642100765119a872d4aba1bf63e3a22c8a Mon Sep 17 00:00:00 2001 From: Muhammad Husaini Zulkifli Date: Mon, 24 Jul 2023 09:12:50 -0700 Subject: [PATCH 34/49] igc: Fix Kernel Panic during ndo_tx_timeout callback The Xeon validation group has been carrying out some loaded tests with various HW configurations, and they have seen some transmit queue time out happening during the test. This will cause the reset adapter function to be called by igc_tx_timeout(). Similar race conditions may arise when the interface is being brought down and up in igc_reinit_locked(), an interrupt being generated, and igc_clean_tx_irq() being called to complete the TX. When the igc_tx_timeout() function is invoked, this patch will turn off all TX ring HW queues during igc_down() process. TX ring HW queues will be activated again during the igc_configure_tx_ring() process when performing the igc_up() procedure later. This patch also moved existing igc_disable_tx_ring_hw() to avoid using forward declaration. Kernel trace: [ 7678.747813] ------------[ cut here ]------------ [ 7678.757914] NETDEV WATCHDOG: enp1s0 (igc): transmit queue 2 timed out [ 7678.770117] WARNING: CPU: 0 PID: 13 at net/sched/sch_generic.c:525 dev_watchdog+0x1ae/0x1f0 [ 7678.784459] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 7678.784496] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [ 7679.200403] RIP: 0010:dev_watchdog+0x1ae/0x1f0 [ 7679.210201] Code: 28 e9 53 ff ff ff 4c 89 e7 c6 05 06 42 b9 00 01 e8 17 d1 fb ff 44 89 e9 4c 89 e6 48 c7 c7 40 ad fb 81 48 89 c2 e8 52 62 82 ff <0f> 0b e9 72 ff ff ff 65 8b 05 80 7d 7c 7e 89 c0 48 0f a3 05 0a c1 [ 7679.245438] RSP: 0018:ffa00000001f7d90 EFLAGS: 00010282 [ 7679.256021] RAX: 0000000000000000 RBX: ff11000109938440 RCX: 0000000000000000 [ 7679.268710] RDX: ff11000361e26cd8 RSI: ff11000361e1b880 RDI: ff11000361e1b880 [ 7679.281314] RBP: ffa00000001f7da8 R08: ff1100035f8fffe8 R09: 0000000000027ffb [ 7679.293840] R10: 0000000000001f0a R11: ff1100035f840000 R12: ff11000109938000 [ 7679.306276] R13: 0000000000000002 R14: dead000000000122 R15: ffa00000001f7e18 [ 7679.318648] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 7679.332064] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7679.342757] CR2: 00007ffff7fca168 CR3: 000000013b08a006 CR4: 0000000000471ef8 [ 7679.354984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7679.367207] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 7679.379370] PKRU: 55555554 [ 7679.386446] Call Trace: [ 7679.393152] [ 7679.399363] ? __pfx_dev_watchdog+0x10/0x10 [ 7679.407870] call_timer_fn+0x31/0x110 [ 7679.415698] expire_timers+0xb2/0x120 [ 7679.423403] run_timer_softirq+0x179/0x1e0 [ 7679.431532] ? __schedule+0x2b1/0x820 [ 7679.439078] __do_softirq+0xd1/0x295 [ 7679.446426] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 7679.454867] run_ksoftirqd+0x22/0x30 [ 7679.462058] smpboot_thread_fn+0xb7/0x160 [ 7679.469670] kthread+0xcd/0xf0 [ 7679.476097] ? __pfx_kthread+0x10/0x10 [ 7679.483211] ret_from_fork+0x29/0x50 [ 7679.490047] [ 7679.495204] ---[ end trace 0000000000000000 ]--- [ 7679.503179] igc 0000:01:00.0 enp1s0: Register Dump [ 7679.511230] igc 0000:01:00.0 enp1s0: Register Name Value [ 7679.519892] igc 0000:01:00.0 enp1s0: CTRL 181c0641 [ 7679.528782] igc 0000:01:00.0 enp1s0: STATUS 40280683 [ 7679.537551] igc 0000:01:00.0 enp1s0: CTRL_EXT 10000040 [ 7679.546284] igc 0000:01:00.0 enp1s0: MDIC 180a3800 [ 7679.554942] igc 0000:01:00.0 enp1s0: ICR 00000081 [ 7679.563503] igc 0000:01:00.0 enp1s0: RCTL 04408022 [ 7679.571963] igc 0000:01:00.0 enp1s0: RDLEN[0-3] 00001000 00001000 00001000 00001000 [ 7679.583075] igc 0000:01:00.0 enp1s0: RDH[0-3] 00000068 000000b6 0000000f 00000031 [ 7679.594162] igc 0000:01:00.0 enp1s0: RDT[0-3] 00000066 000000b2 0000000e 00000030 [ 7679.605174] igc 0000:01:00.0 enp1s0: RXDCTL[0-3] 02040808 02040808 02040808 02040808 [ 7679.616196] igc 0000:01:00.0 enp1s0: RDBAL[0-3] 1bb7c000 1bb7f000 1bb82000 0ef33000 [ 7679.627242] igc 0000:01:00.0 enp1s0: RDBAH[0-3] 00000001 00000001 00000001 00000001 [ 7679.638256] igc 0000:01:00.0 enp1s0: TCTL a503f0fa [ 7679.646607] igc 0000:01:00.0 enp1s0: TDBAL[0-3] 2ba4a000 1bb6f000 1bb74000 1bb79000 [ 7679.657609] igc 0000:01:00.0 enp1s0: TDBAH[0-3] 00000001 00000001 00000001 00000001 [ 7679.668551] igc 0000:01:00.0 enp1s0: TDLEN[0-3] 00001000 00001000 00001000 00001000 [ 7679.679470] igc 0000:01:00.0 enp1s0: TDH[0-3] 000000a7 0000002d 000000bf 000000d9 [ 7679.690406] igc 0000:01:00.0 enp1s0: TDT[0-3] 000000a7 0000002d 000000bf 000000d9 [ 7679.701264] igc 0000:01:00.0 enp1s0: TXDCTL[0-3] 02100108 02100108 02100108 02100108 [ 7679.712123] igc 0000:01:00.0 enp1s0: Reset adapter [ 7683.085967] igc 0000:01:00.0 enp1s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX [ 8086.945561] ------------[ cut here ]------------ Entering kdb (current=0xffffffff8220b200, pid 0) on processor 0 Oops: (null) due to oops @ 0xffffffff81573888 RIP: 0010:dql_completed+0x148/0x160 Code: c9 00 48 89 57 58 e9 46 ff ff ff 45 85 e4 41 0f 95 c4 41 39 db 0f 95 c1 41 84 cc 74 05 45 85 ed 78 0a 44 89 c1 e9 27 ff ff ff <0f> 0b 01 f6 44 89 c1 29 f1 0f 48 ca eb 8c cc cc cc cc cc cc cc cc RSP: 0018:ffa0000000003e00 EFLAGS: 00010287 RAX: 000000000000006c RBX: ffa0000003eb0f78 RCX: ff11000109938000 RDX: 0000000000000003 RSI: 0000000000000160 RDI: ff110001002e9480 RBP: ffa0000000003ed8 R08: ff110001002e93c0 R09: ffa0000000003d28 R10: 0000000000007cc0 R11: 0000000000007c54 R12: 00000000ffffffd9 R13: ff1100037039cb00 R14: 00000000ffffffd9 R15: ff1100037039c048 FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ? igc_poll+0x1a9/0x14d0 [igc] __napi_poll+0x2e/0x1b0 net_rx_action+0x126/0x250 __do_softirq+0xd1/0x295 irq_exit_rcu+0xc5/0xf0 common_interrupt+0x86/0xa0 asm_common_interrupt+0x27/0x40 RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 cpuidle_enter+0x2e/0x50 call_cpuidle+0x23/0x40 do_idle+0x1be/0x220 cpu_startup_entry+0x20/0x30 rest_init+0xb5/0xc0 arch_call_rest_init+0xe/0x30 start_kernel+0x448/0x760 x86_64_start_kernel+0x109/0x150 secondary_startup_64_no_verify+0xe0/0xeb more> [0]kdb> [0]kdb> [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, type go a second time if you really want to continue [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, attempting to continue [ 8086.955689] refcount_t: underflow; use-after-free. [ 8086.955697] WARNING: CPU: 0 PID: 0 at lib/refcount.c:28 refcount_warn_saturate+0xc2/0x110 [ 8086.955706] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 8086.955751] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [ 8086.955784] RIP: 0010:refcount_warn_saturate+0xc2/0x110 [ 8086.955788] Code: 01 e8 82 e7 b4 ff 0f 0b 5d c3 cc cc cc cc 80 3d 68 c6 eb 00 00 75 81 48 c7 c7 a0 87 f6 81 c6 05 58 c6 eb 00 01 e8 5e e7 b4 ff <0f> 0b 5d c3 cc cc cc cc 80 3d 42 c6 eb 00 00 0f 85 59 ff ff ff 48 [ 8086.955790] RSP: 0018:ffa0000000003da0 EFLAGS: 00010286 [ 8086.955793] RAX: 0000000000000000 RBX: ff1100011da40ee0 RCX: ff11000361e1b888 [ 8086.955794] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ff11000361e1b880 [ 8086.955795] RBP: ffa0000000003da0 R08: 80000000ffff9f45 R09: ffa0000000003d28 [ 8086.955796] R10: ff1100035f840000 R11: 0000000000000028 R12: ff11000319ff8000 [ 8086.955797] R13: ff1100011bb79d60 R14: 00000000ffffffd6 R15: ff1100037039cb00 [ 8086.955798] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 8086.955800] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8086.955801] CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 [ 8086.955803] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8086.955803] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 8086.955804] PKRU: 55555554 [ 8086.955805] Call Trace: [ 8086.955806] [ 8086.955808] tcp_wfree+0x112/0x130 [ 8086.955814] skb_release_head_state+0x24/0xa0 [ 8086.955818] napi_consume_skb+0x9c/0x160 [ 8086.955821] igc_poll+0x5d8/0x14d0 [igc] [ 8086.955835] __napi_poll+0x2e/0x1b0 [ 8086.955839] net_rx_action+0x126/0x250 [ 8086.955843] __do_softirq+0xd1/0x295 [ 8086.955846] irq_exit_rcu+0xc5/0xf0 [ 8086.955851] common_interrupt+0x86/0xa0 [ 8086.955857] [ 8086.955857] [ 8086.955858] asm_common_interrupt+0x27/0x40 [ 8086.955862] RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 [ 8086.955866] Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d [ 8086.955867] RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 [ 8086.955869] RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f [ 8086.955870] RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 [ 8086.955871] RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 [ 8086.955872] R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 [ 8086.955873] R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 [ 8086.955875] cpuidle_enter+0x2e/0x50 [ 8086.955880] call_cpuidle+0x23/0x40 [ 8086.955884] do_idle+0x1be/0x220 [ 8086.955887] cpu_startup_entry+0x20/0x30 [ 8086.955889] rest_init+0xb5/0xc0 [ 8086.955892] arch_call_rest_init+0xe/0x30 [ 8086.955895] start_kernel+0x448/0x760 [ 8086.955898] x86_64_start_kernel+0x109/0x150 [ 8086.955900] secondary_startup_64_no_verify+0xe0/0xeb [ 8086.955904] [ 8086.955904] ---[ end trace 0000000000000000 ]--- [ 8086.955912] ------------[ cut here ]------------ [ 8086.955913] kernel BUG at lib/dynamic_queue_limits.c:27! [ 8086.955918] invalid opcode: 0000 [#1] SMP [ 8086.955922] RIP: 0010:dql_completed+0x148/0x160 [ 8086.955925] Code: c9 00 48 89 57 58 e9 46 ff ff ff 45 85 e4 41 0f 95 c4 41 39 db 0f 95 c1 41 84 cc 74 05 45 85 ed 78 0a 44 89 c1 e9 27 ff ff ff <0f> 0b 01 f6 44 89 c1 29 f1 0f 48 ca eb 8c cc cc cc cc cc cc cc cc [ 8086.955927] RSP: 0018:ffa0000000003e00 EFLAGS: 00010287 [ 8086.955928] RAX: 000000000000006c RBX: ffa0000003eb0f78 RCX: ff11000109938000 [ 8086.955929] RDX: 0000000000000003 RSI: 0000000000000160 RDI: ff110001002e9480 [ 8086.955930] RBP: ffa0000000003ed8 R08: ff110001002e93c0 R09: ffa0000000003d28 [ 8086.955931] R10: 0000000000007cc0 R11: 0000000000007c54 R12: 00000000ffffffd9 [ 8086.955932] R13: ff1100037039cb00 R14: 00000000ffffffd9 R15: ff1100037039c048 [ 8086.955933] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 8086.955934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8086.955935] CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 [ 8086.955936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8086.955937] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 8086.955938] PKRU: 55555554 [ 8086.955939] Call Trace: [ 8086.955939] [ 8086.955940] ? igc_poll+0x1a9/0x14d0 [igc] [ 8086.955949] __napi_poll+0x2e/0x1b0 [ 8086.955952] net_rx_action+0x126/0x250 [ 8086.955956] __do_softirq+0xd1/0x295 [ 8086.955958] irq_exit_rcu+0xc5/0xf0 [ 8086.955961] common_interrupt+0x86/0xa0 [ 8086.955964] [ 8086.955965] [ 8086.955965] asm_common_interrupt+0x27/0x40 [ 8086.955968] RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 [ 8086.955971] Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d [ 8086.955972] RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 [ 8086.955973] RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f [ 8086.955974] RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 [ 8086.955974] RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 [ 8086.955975] R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 [ 8086.955976] R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 [ 8086.955978] cpuidle_enter+0x2e/0x50 [ 8086.955981] call_cpuidle+0x23/0x40 [ 8086.955984] do_idle+0x1be/0x220 [ 8086.955985] cpu_startup_entry+0x20/0x30 [ 8086.955987] rest_init+0xb5/0xc0 [ 8086.955990] arch_call_rest_init+0xe/0x30 [ 8086.955992] start_kernel+0x448/0x760 [ 8086.955994] x86_64_start_kernel+0x109/0x150 [ 8086.955996] secondary_startup_64_no_verify+0xe0/0xeb [ 8086.955998] [ 8086.955999] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 8086.956029] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [16762.543675] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.593 msecs [16762.543678] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.595 msecs [16762.543673] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.495 msecs [16762.543679] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.599 msecs [16762.543678] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.598 msecs [16762.543690] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.605 msecs [16762.543684] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.599 msecs [16762.543693] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.613 msecs [16762.543784] ---[ end trace 0000000000000000 ]--- [16762.849099] RIP: 0010:dql_completed+0x148/0x160 PANIC: Fatal exception in interrupt Fixes: 9b275176270e ("igc: Add ndo_tx_timeout support") Tested-by: Alejandra Victoria Alcaraz Signed-off-by: Muhammad Husaini Zulkifli Acked-by: Sasha Neftin Tested-by: Naama Meir Signed-off-by: Tony Nguyen Reviewed-by: Simon Horman Signed-off-by: David S. Miller --- drivers/net/ethernet/intel/igc/igc_main.c | 40 ++++++++++++++++------- 1 file changed, 28 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c index f36bc2a1849ad8..bdeb36790d774d 100644 --- a/drivers/net/ethernet/intel/igc/igc_main.c +++ b/drivers/net/ethernet/intel/igc/igc_main.c @@ -316,6 +316,33 @@ static void igc_clean_all_tx_rings(struct igc_adapter *adapter) igc_clean_tx_ring(adapter->tx_ring[i]); } +static void igc_disable_tx_ring_hw(struct igc_ring *ring) +{ + struct igc_hw *hw = &ring->q_vector->adapter->hw; + u8 idx = ring->reg_idx; + u32 txdctl; + + txdctl = rd32(IGC_TXDCTL(idx)); + txdctl &= ~IGC_TXDCTL_QUEUE_ENABLE; + txdctl |= IGC_TXDCTL_SWFLUSH; + wr32(IGC_TXDCTL(idx), txdctl); +} + +/** + * igc_disable_all_tx_rings_hw - Disable all transmit queue operation + * @adapter: board private structure + */ +static void igc_disable_all_tx_rings_hw(struct igc_adapter *adapter) +{ + int i; + + for (i = 0; i < adapter->num_tx_queues; i++) { + struct igc_ring *tx_ring = adapter->tx_ring[i]; + + igc_disable_tx_ring_hw(tx_ring); + } +} + /** * igc_setup_tx_resources - allocate Tx resources (Descriptors) * @tx_ring: tx descriptor ring (for a specific queue) to setup @@ -5058,6 +5085,7 @@ void igc_down(struct igc_adapter *adapter) /* clear VLAN promisc flag so VFTA will be updated if necessary */ adapter->flags &= ~IGC_FLAG_VLAN_PROMISC; + igc_disable_all_tx_rings_hw(adapter); igc_clean_all_tx_rings(adapter); igc_clean_all_rx_rings(adapter); } @@ -7290,18 +7318,6 @@ void igc_enable_rx_ring(struct igc_ring *ring) igc_alloc_rx_buffers(ring, igc_desc_unused(ring)); } -static void igc_disable_tx_ring_hw(struct igc_ring *ring) -{ - struct igc_hw *hw = &ring->q_vector->adapter->hw; - u8 idx = ring->reg_idx; - u32 txdctl; - - txdctl = rd32(IGC_TXDCTL(idx)); - txdctl &= ~IGC_TXDCTL_QUEUE_ENABLE; - txdctl |= IGC_TXDCTL_SWFLUSH; - wr32(IGC_TXDCTL(idx), txdctl); -} - void igc_disable_tx_ring(struct igc_ring *ring) { igc_disable_tx_ring_hw(ring); From f718863aca469a109895cb855e6b81fff4827d71 Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Thu, 20 Jul 2023 21:30:05 +0200 Subject: [PATCH 35/49] netfilter: nft_set_rbtree: fix overlap expiration walk The lazy gc on insert that should remove timed-out entries fails to release the other half of the interval, if any. Can be reproduced with tests/shell/testcases/sets/0044interval_overlap_0 in nftables.git and kmemleak enabled kernel. Second bug is the use of rbe_prev vs. prev pointer. If rbe_prev() returns NULL after at least one iteration, rbe_prev points to element that is not an end interval, hence it should not be removed. Lastly, check the genmask of the end interval if this is active in the current generation. Fixes: c9e6978e2725 ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection") Signed-off-by: Florian Westphal --- net/netfilter/nft_set_rbtree.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index 5c05c9b990fba3..8d73fffd2d09da 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -217,29 +217,37 @@ static void *nft_rbtree_get(const struct net *net, const struct nft_set *set, static int nft_rbtree_gc_elem(const struct nft_set *__set, struct nft_rbtree *priv, - struct nft_rbtree_elem *rbe) + struct nft_rbtree_elem *rbe, + u8 genmask) { struct nft_set *set = (struct nft_set *)__set; struct rb_node *prev = rb_prev(&rbe->node); - struct nft_rbtree_elem *rbe_prev = NULL; + struct nft_rbtree_elem *rbe_prev; struct nft_set_gc_batch *gcb; gcb = nft_set_gc_batch_check(set, NULL, GFP_ATOMIC); if (!gcb) return -ENOMEM; - /* search for expired end interval coming before this element. */ + /* search for end interval coming before this element. + * end intervals don't carry a timeout extension, they + * are coupled with the interval start element. + */ while (prev) { rbe_prev = rb_entry(prev, struct nft_rbtree_elem, node); - if (nft_rbtree_interval_end(rbe_prev)) + if (nft_rbtree_interval_end(rbe_prev) && + nft_set_elem_active(&rbe_prev->ext, genmask)) break; prev = rb_prev(prev); } - if (rbe_prev) { + if (prev) { + rbe_prev = rb_entry(prev, struct nft_rbtree_elem, node); + rb_erase(&rbe_prev->node, &priv->root); atomic_dec(&set->nelems); + nft_set_gc_batch_add(gcb, rbe_prev); } rb_erase(&rbe->node, &priv->root); @@ -321,7 +329,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, /* perform garbage collection to avoid bogus overlap reports. */ if (nft_set_elem_expired(&rbe->ext)) { - err = nft_rbtree_gc_elem(set, priv, rbe); + err = nft_rbtree_gc_elem(set, priv, rbe, genmask); if (err < 0) return err; From 0a771f7b266b02d262900c75f1e175c7fe76fec2 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Sun, 23 Jul 2023 16:24:46 +0200 Subject: [PATCH 36/49] netfilter: nf_tables: skip immediate deactivate in _PREPARE_ERROR On error when building the rule, the immediate expression unbinds the chain, hence objects can be deactivated by the transaction records. Otherwise, it is possible to trigger the following warning: WARNING: CPU: 3 PID: 915 at net/netfilter/nf_tables_api.c:2013 nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] CPU: 3 PID: 915 Comm: chain-bind-err- Not tainted 6.1.39 #1 RIP: 0010:nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] Fixes: 4bedf9eee016 ("netfilter: nf_tables: fix chain binding transaction logic") Reported-by: Kevin Rich Signed-off-by: Pablo Neira Ayuso Signed-off-by: Florian Westphal --- net/netfilter/nft_immediate.c | 27 ++++++++++++++++++--------- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/net/netfilter/nft_immediate.c b/net/netfilter/nft_immediate.c index 407d7197f75bb3..fccb3cf7749c1d 100644 --- a/net/netfilter/nft_immediate.c +++ b/net/netfilter/nft_immediate.c @@ -125,15 +125,27 @@ static void nft_immediate_activate(const struct nft_ctx *ctx, return nft_data_hold(&priv->data, nft_dreg_to_type(priv->dreg)); } +static void nft_immediate_chain_deactivate(const struct nft_ctx *ctx, + struct nft_chain *chain, + enum nft_trans_phase phase) +{ + struct nft_ctx chain_ctx; + struct nft_rule *rule; + + chain_ctx = *ctx; + chain_ctx.chain = chain; + + list_for_each_entry(rule, &chain->rules, list) + nft_rule_expr_deactivate(&chain_ctx, rule, phase); +} + static void nft_immediate_deactivate(const struct nft_ctx *ctx, const struct nft_expr *expr, enum nft_trans_phase phase) { const struct nft_immediate_expr *priv = nft_expr_priv(expr); const struct nft_data *data = &priv->data; - struct nft_ctx chain_ctx; struct nft_chain *chain; - struct nft_rule *rule; if (priv->dreg == NFT_REG_VERDICT) { switch (data->verdict.code) { @@ -143,20 +155,17 @@ static void nft_immediate_deactivate(const struct nft_ctx *ctx, if (!nft_chain_binding(chain)) break; - chain_ctx = *ctx; - chain_ctx.chain = chain; - - list_for_each_entry(rule, &chain->rules, list) - nft_rule_expr_deactivate(&chain_ctx, rule, phase); - switch (phase) { case NFT_TRANS_PREPARE_ERROR: nf_tables_unbind_chain(ctx, chain); - fallthrough; + nft_deactivate_next(ctx->net, chain); + break; case NFT_TRANS_PREPARE: + nft_immediate_chain_deactivate(ctx, chain, phase); nft_deactivate_next(ctx->net, chain); break; default: + nft_immediate_chain_deactivate(ctx, chain, phase); nft_chain_del(chain); chain->bound = false; nft_use_dec(&chain->table->use); From 0ebc1064e4874d5987722a2ddbc18f94aa53b211 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Sun, 23 Jul 2023 16:41:48 +0200 Subject: [PATCH 37/49] netfilter: nf_tables: disallow rule addition to bound chain via NFTA_RULE_CHAIN_ID Bail out with EOPNOTSUPP when adding rule to bound chain via NFTA_RULE_CHAIN_ID. The following warning splat is shown when adding a rule to a deleted bound chain: WARNING: CPU: 2 PID: 13692 at net/netfilter/nf_tables_api.c:2013 nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] CPU: 2 PID: 13692 Comm: chain-bound-rul Not tainted 6.1.39 #1 RIP: 0010:nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] Fixes: d0e2c7de92c7 ("netfilter: nf_tables: add NFT_CHAIN_BINDING") Reported-by: Kevin Rich Signed-off-by: Pablo Neira Ayuso Signed-off-by: Florian Westphal --- net/netfilter/nf_tables_api.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index b9a4d3fd1d3487..d3c6ecd1f5a680 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3811,8 +3811,6 @@ static int nf_tables_newrule(struct sk_buff *skb, const struct nfnl_info *info, NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_CHAIN]); return PTR_ERR(chain); } - if (nft_chain_is_bound(chain)) - return -EOPNOTSUPP; } else if (nla[NFTA_RULE_CHAIN_ID]) { chain = nft_chain_lookup_byid(net, table, nla[NFTA_RULE_CHAIN_ID], @@ -3825,6 +3823,9 @@ static int nf_tables_newrule(struct sk_buff *skb, const struct nfnl_info *info, return -EINVAL; } + if (nft_chain_is_bound(chain)) + return -EOPNOTSUPP; + if (nla[NFTA_RULE_HANDLE]) { handle = be64_to_cpu(nla_get_be64(nla[NFTA_RULE_HANDLE])); rule = __nft_rule_lookup(chain, handle); From d7ddf5f4269fcaf19aafe971e635d91897423a3a Mon Sep 17 00:00:00 2001 From: Arkadiusz Kubalewski Date: Tue, 25 Jul 2023 12:16:41 +0200 Subject: [PATCH 38/49] tools: ynl-gen: fix enum index in _decode_enum(..) Remove wrong index adjustment, which is leftover from adding support for sparse enums. enum.entries_by_val() function shall not subtract the start-value, as it is indexed with real enum value. Fixes: c311aaa74ca1 ("tools: ynl: fix enum-as-flags in the generic CLI") Signed-off-by: Arkadiusz Kubalewski Reviewed-by: Donald Hunter Link: https://lore.kernel.org/r/20230725101642.267248-2-arkadiusz.kubalewski@intel.com Reviewed-by: Jakub Kicinski Signed-off-by: Jakub Kicinski --- tools/net/ynl/lib/ynl.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/net/ynl/lib/ynl.py b/tools/net/ynl/lib/ynl.py index 1b3a36fbb1c376..027b1c0aecb497 100644 --- a/tools/net/ynl/lib/ynl.py +++ b/tools/net/ynl/lib/ynl.py @@ -420,8 +420,8 @@ def _add_attr(self, space, name, value): def _decode_enum(self, rsp, attr_spec): raw = rsp[attr_spec['name']] enum = self.consts[attr_spec['enum']] - i = attr_spec.get('value-start', 0) if 'enum-as-flags' in attr_spec and attr_spec['enum-as-flags']: + i = 0 value = set() while raw: if raw & 1: @@ -429,7 +429,7 @@ def _decode_enum(self, rsp, attr_spec): raw >>= 1 i += 1 else: - value = enum.entries_by_val[raw - i].name + value = enum.entries_by_val[raw].name rsp[attr_spec['name']] = value def _decode_binary(self, attr, attr_spec): From df15c15e6c987a32ecbafe011ffae8a53c84cb4f Mon Sep 17 00:00:00 2001 From: Arkadiusz Kubalewski Date: Tue, 25 Jul 2023 12:16:42 +0200 Subject: [PATCH 39/49] tools: ynl-gen: fix parse multi-attr enum attribute When attribute is enum type and marked as multi-attr, the netlink respond is not parsed, fails with stack trace: Traceback (most recent call last): File "/net-next/tools/net/ynl/./test.py", line 520, in main() File "/net-next/tools/net/ynl/./test.py", line 488, in main dplls=dplls_get(282574471561216) File "/net-next/tools/net/ynl/./test.py", line 48, in dplls_get reply=act(args) File "/net-next/tools/net/ynl/./test.py", line 41, in act reply = ynl.dump(args.dump, attrs) File "/net-next/tools/net/ynl/lib/ynl.py", line 598, in dump return self._op(method, vals, dump=True) File "/net-next/tools/net/ynl/lib/ynl.py", line 584, in _op rsp_msg = self._decode(gm.raw_attrs, op.attr_set.name) File "/net-next/tools/net/ynl/lib/ynl.py", line 451, in _decode self._decode_enum(rsp, attr_spec) File "/net-next/tools/net/ynl/lib/ynl.py", line 408, in _decode_enum value = enum.entries_by_val[raw].name TypeError: unhashable type: 'list' error: 1 Redesign _decode_enum(..) to take a enum int value and translate it to either a bitmask or enum name as expected. Signed-off-by: Arkadiusz Kubalewski Reviewed-by: Donald Hunter Link: https://lore.kernel.org/r/20230725101642.267248-3-arkadiusz.kubalewski@intel.com Reviewed-by: Jakub Kicinski Signed-off-by: Jakub Kicinski --- tools/net/ynl/lib/ynl.py | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/tools/net/ynl/lib/ynl.py b/tools/net/ynl/lib/ynl.py index 027b1c0aecb497..3ca28d4bcb1887 100644 --- a/tools/net/ynl/lib/ynl.py +++ b/tools/net/ynl/lib/ynl.py @@ -417,8 +417,7 @@ def _add_attr(self, space, name, value): pad = b'\x00' * ((4 - len(attr_payload) % 4) % 4) return struct.pack('HH', len(attr_payload) + 4, nl_type) + attr_payload + pad - def _decode_enum(self, rsp, attr_spec): - raw = rsp[attr_spec['name']] + def _decode_enum(self, raw, attr_spec): enum = self.consts[attr_spec['enum']] if 'enum-as-flags' in attr_spec and attr_spec['enum-as-flags']: i = 0 @@ -430,7 +429,7 @@ def _decode_enum(self, rsp, attr_spec): i += 1 else: value = enum.entries_by_val[raw].name - rsp[attr_spec['name']] = value + return value def _decode_binary(self, attr, attr_spec): if attr_spec.struct_name: @@ -438,7 +437,7 @@ def _decode_binary(self, attr, attr_spec): decoded = attr.as_struct(members) for m in members: if m.enum: - self._decode_enum(decoded, m) + decoded[m.name] = self._decode_enum(decoded[m.name], m) elif attr_spec.sub_type: decoded = attr.as_c_array(attr_spec.sub_type) else: @@ -466,6 +465,9 @@ def _decode(self, attrs, space): else: raise Exception(f'Unknown {attr_spec["type"]} with name {attr_spec["name"]}') + if 'enum' in attr_spec: + decoded = self._decode_enum(decoded, attr_spec) + if not attr_spec.is_multi: rsp[attr_spec['name']] = decoded elif attr_spec.name in rsp: @@ -473,8 +475,6 @@ def _decode(self, attrs, space): else: rsp[attr_spec.name] = [decoded] - if 'enum' in attr_spec: - self._decode_enum(rsp, attr_spec) return rsp def _decode_extack_path(self, attrs, attr_set, offset, target): From 016e7ba47f33064fbef8c4307a2485d2669dfd03 Mon Sep 17 00:00:00 2001 From: Matthieu Baerts Date: Tue, 25 Jul 2023 11:34:55 -0700 Subject: [PATCH 40/49] selftests: mptcp: join: only check for ip6tables if needed If 'iptables-legacy' is available, 'ip6tables-legacy' command will be used instead of 'ip6tables'. So no need to look if 'ip6tables' is available in this case. Cc: stable@vger.kernel.org Fixes: 0c4cd3f86a40 ("selftests: mptcp: join: use 'iptables-legacy' if available") Acked-by: Paolo Abeni Signed-off-by: Matthieu Baerts Signed-off-by: Mat Martineau Link: https://lore.kernel.org/r/20230725-send-net-20230725-v1-1-6f60fe7137a9@kernel.org Signed-off-by: Jakub Kicinski --- tools/testing/selftests/net/mptcp/mptcp_join.sh | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh index e6c9d5451c5b50..3c2096ac97ef48 100755 --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh @@ -162,9 +162,7 @@ check_tools() elif ! iptables -V &> /dev/null; then echo "SKIP: Could not run all tests without iptables tool" exit $ksft_skip - fi - - if ! ip6tables -V &> /dev/null; then + elif ! ip6tables -V &> /dev/null; then echo "SKIP: Could not run all tests without ip6tables tool" exit $ksft_skip fi From 21d9b73a7d5241905367098d260a3c68b811da32 Mon Sep 17 00:00:00 2001 From: Paolo Abeni Date: Tue, 25 Jul 2023 11:34:56 -0700 Subject: [PATCH 41/49] mptcp: more accurate NL event generation Currently the mptcp code generate a "new listener" event even if the actual listen() syscall fails. Address the issue moving the event generation call under the successful branch. Cc: stable@vger.kernel.org Fixes: f8c9dfbd875b ("mptcp: add pm listener events") Reviewed-by: Mat Martineau Signed-off-by: Paolo Abeni Signed-off-by: Mat Martineau Link: https://lore.kernel.org/r/20230725-send-net-20230725-v1-2-6f60fe7137a9@kernel.org Signed-off-by: Jakub Kicinski --- net/mptcp/protocol.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 3613489eb6e3b0..3317d1cca15688 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -3723,10 +3723,9 @@ static int mptcp_listen(struct socket *sock, int backlog) if (!err) { sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1); mptcp_copy_inaddrs(sk, ssock->sk); + mptcp_event_pm_listener(ssock->sk, MPTCP_EVENT_LISTENER_CREATED); } - mptcp_event_pm_listener(ssock->sk, MPTCP_EVENT_LISTENER_CREATED); - unlock: release_sock(sk); return err; From 15cec633fc7bfe4cd69aa012c3b35b31acfc86f2 Mon Sep 17 00:00:00 2001 From: Wei Fang Date: Tue, 25 Jul 2023 15:41:48 +0800 Subject: [PATCH 42/49] net: fec: tx processing does not call XDP APIs if budget is 0 According to the clarification [1] in the latest napi.rst, the tx processing cannot call any XDP (or page pool) APIs if the "budget" is 0. Because NAPI is called with the budget of 0 (such as netpoll) indicates we may be in an IRQ context, however, we cannot use the page pool from IRQ context. [1] https://lore.kernel.org/all/20230720161323.2025379-1-kuba@kernel.org/ Fixes: 20f797399035 ("net: fec: recycle pages for transmitted XDP frames") Signed-off-by: Wei Fang Suggested-by: Jakub Kicinski Link: https://lore.kernel.org/r/20230725074148.2936402-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/freescale/fec_main.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index 073d616193363b..66b5cbdb43b9e3 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -1372,7 +1372,7 @@ fec_enet_hwtstamp(struct fec_enet_private *fep, unsigned ts, } static void -fec_enet_tx_queue(struct net_device *ndev, u16 queue_id) +fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget) { struct fec_enet_private *fep; struct xdp_frame *xdpf; @@ -1416,6 +1416,14 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id) if (!skb) goto tx_buf_done; } else { + /* Tx processing cannot call any XDP (or page pool) APIs if + * the "budget" is 0. Because NAPI is called with budget of + * 0 (such as netpoll) indicates we may be in an IRQ context, + * however, we can't use the page pool from IRQ context. + */ + if (unlikely(!budget)) + break; + xdpf = txq->tx_buf[index].xdp; if (bdp->cbd_bufaddr) dma_unmap_single(&fep->pdev->dev, @@ -1508,14 +1516,14 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id) writel(0, txq->bd.reg_desc_active); } -static void fec_enet_tx(struct net_device *ndev) +static void fec_enet_tx(struct net_device *ndev, int budget) { struct fec_enet_private *fep = netdev_priv(ndev); int i; /* Make sure that AVB queues are processed first. */ for (i = fep->num_tx_queues - 1; i >= 0; i--) - fec_enet_tx_queue(ndev, i); + fec_enet_tx_queue(ndev, i, budget); } static void fec_enet_update_cbd(struct fec_enet_priv_rx_q *rxq, @@ -1858,7 +1866,7 @@ static int fec_enet_rx_napi(struct napi_struct *napi, int budget) do { done += fec_enet_rx(ndev, budget - done); - fec_enet_tx(ndev); + fec_enet_tx(ndev, budget); } while ((done < budget) && fec_enet_collect_events(fep)); if (done < budget) { From 0f0fa27b871b53a62c3cd2b054ebcce199277310 Mon Sep 17 00:00:00 2001 From: Jan Stancek Date: Mon, 24 Jul 2023 19:39:04 +0200 Subject: [PATCH 43/49] splice, net: Fix splice_to_socket() for O_NONBLOCK socket LTP sendfile07 [1], which expects sendfile() to return EAGAIN when transferring data from regular file to a "full" O_NONBLOCK socket, started failing after commit 2dc334f1a63a ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()"). sendfile() no longer immediately returns, but now blocks. Removed sock_sendpage() handled this case by setting a MSG_DONTWAIT flag, fix new splice_to_socket() to do the same for O_NONBLOCK sockets. [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/sendfile/sendfile07.c Fixes: 2dc334f1a63a ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()") Acked-by: David Howells Signed-off-by: Jan Stancek Tested-by: Xi Ruoyao Link: https://lore.kernel.org/r/023c0e21e595e00b93903a813bc0bfb9a5d7e368.1690219914.git.jstancek@redhat.com Signed-off-by: Jakub Kicinski --- fs/splice.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/splice.c b/fs/splice.c index 004eb1c4ce318c..3e2a31e1ce6a8f 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -876,6 +876,8 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, msg.msg_flags |= MSG_MORE; if (remain && pipe_occupancy(pipe->head, tail) > 0) msg.msg_flags |= MSG_MORE; + if (out->f_flags & O_NONBLOCK) + msg.msg_flags |= MSG_DONTWAIT; iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, bc, len - remain); From 6c58c8816abb7b93b21fa3b1d0c1726402e5e568 Mon Sep 17 00:00:00 2001 From: Lin Ma Date: Tue, 25 Jul 2023 10:42:27 +0800 Subject: [PATCH 44/49] net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64 The nla_for_each_nested parsing in function mqprio_parse_nlattr() does not check the length of the nested attribute. This can lead to an out-of-attribute read and allow a malformed nlattr (e.g., length 0) to be viewed as 8 byte integer and passed to priv->max_rate/min_rate. This patch adds the check based on nla_len() when check the nla_type(), which ensures that the length of these two attribute must equals sizeof(u64). Fixes: 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and shaper in mqprio") Reviewed-by: Victor Nogueira Signed-off-by: Lin Ma Link: https://lore.kernel.org/r/20230725024227.426561-1-linma@zju.edu.cn Signed-off-by: Jakub Kicinski --- net/sched/sch_mqprio.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c index ab69ff7577fc7d..793009f445c03b 100644 --- a/net/sched/sch_mqprio.c +++ b/net/sched/sch_mqprio.c @@ -290,6 +290,13 @@ static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt, "Attribute type expected to be TCA_MQPRIO_MIN_RATE64"); return -EINVAL; } + + if (nla_len(attr) != sizeof(u64)) { + NL_SET_ERR_MSG_ATTR(extack, attr, + "Attribute TCA_MQPRIO_MIN_RATE64 expected to have 8 bytes length"); + return -EINVAL; + } + if (i >= qopt->num_tc) break; priv->min_rate[i] = nla_get_u64(attr); @@ -312,6 +319,13 @@ static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt, "Attribute type expected to be TCA_MQPRIO_MAX_RATE64"); return -EINVAL; } + + if (nla_len(attr) != sizeof(u64)) { + NL_SET_ERR_MSG_ATTR(extack, attr, + "Attribute TCA_MQPRIO_MAX_RATE64 expected to have 8 bytes length"); + return -EINVAL; + } + if (i >= qopt->num_tc) break; priv->max_rate[i] = nla_get_u64(attr); From 25266128fe16d5632d43ada34c847d7b8daba539 Mon Sep 17 00:00:00 2001 From: Jason Wang Date: Tue, 25 Jul 2023 03:20:49 -0400 Subject: [PATCH 45/49] virtio-net: fix race between set queues and probe A race were found where set_channels could be called after registering but before virtnet_set_queues() in virtnet_probe(). Fixing this by moving the virtnet_set_queues() before netdevice registering. While at it, use _virtnet_set_queues() to avoid holding rtnl as the device is not even registered at that time. Cc: stable@vger.kernel.org Fixes: a220871be66f ("virtio-net: correctly enable multiqueue") Signed-off-by: Jason Wang Acked-by: Michael S. Tsirkin Reviewed-by: Xuan Zhuo Link: https://lore.kernel.org/r/20230725072049.617289-1-jasowang@redhat.com Signed-off-by: Jakub Kicinski --- drivers/net/virtio_net.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 0db14f6b87d3ed..1270c8d23463fa 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -4219,6 +4219,8 @@ static int virtnet_probe(struct virtio_device *vdev) if (vi->has_rss || vi->has_rss_hash_report) virtnet_init_default_rss(vi); + _virtnet_set_queues(vi, vi->curr_queue_pairs); + /* serialize netdev register + virtio_device_ready() with ndo_open() */ rtnl_lock(); @@ -4257,8 +4259,6 @@ static int virtnet_probe(struct virtio_device *vdev) goto free_unregister_netdev; } - virtnet_set_queues(vi, vi->curr_queue_pairs); - /* Assume link up if device can't report link status, otherwise get link status from config. */ netif_carrier_off(dev); From 5c85f7065718a949902b238a6abd8fc907c5d3e0 Mon Sep 17 00:00:00 2001 From: Yuanjun Gong Date: Tue, 25 Jul 2023 11:27:26 +0800 Subject: [PATCH 46/49] benet: fix return value check in be_lancer_xmit_workarounds() in be_lancer_xmit_workarounds(), it should go to label 'tx_drop' if an unexpected value is returned by pskb_trim(). Fixes: 93040ae5cc8d ("be2net: Fix to trim skb for padded vlan packets to workaround an ASIC Bug") Signed-off-by: Yuanjun Gong Link: https://lore.kernel.org/r/20230725032726.15002-1-ruc_gongyuanjun@163.com Signed-off-by: Paolo Abeni --- drivers/net/ethernet/emulex/benet/be_main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c index 18c2fc880d093d..0616b5fe241cbe 100644 --- a/drivers/net/ethernet/emulex/benet/be_main.c +++ b/drivers/net/ethernet/emulex/benet/be_main.c @@ -1138,7 +1138,8 @@ static struct sk_buff *be_lancer_xmit_workarounds(struct be_adapter *adapter, (lancer_chip(adapter) || BE3_chip(adapter) || skb_vlan_tag_present(skb)) && is_ipv4_pkt(skb)) { ip = (struct iphdr *)ip_hdr(skb); - pskb_trim(skb, eth_hdr_len + ntohs(ip->tot_len)); + if (unlikely(pskb_trim(skb, eth_hdr_len + ntohs(ip->tot_len)))) + goto tx_drop; } /* If vlan tag is already inlined in the packet, skip HW VLAN From e46e06ffc6d667a89b979701288e2264f45e6a7b Mon Sep 17 00:00:00 2001 From: Yuanjun Gong Date: Tue, 25 Jul 2023 14:48:10 +0800 Subject: [PATCH 47/49] tipc: check return value of pskb_trim() goto free_skb if an unexpected result is returned by pskb_tirm() in tipc_crypto_rcv_complete(). Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication") Signed-off-by: Yuanjun Gong Reviewed-by: Tung Nguyen Link: https://lore.kernel.org/r/20230725064810.5820-1-ruc_gongyuanjun@163.com Signed-off-by: Paolo Abeni --- net/tipc/crypto.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/tipc/crypto.c b/net/tipc/crypto.c index 577fa5af33ec7b..302fd749c4249a 100644 --- a/net/tipc/crypto.c +++ b/net/tipc/crypto.c @@ -1960,7 +1960,8 @@ static void tipc_crypto_rcv_complete(struct net *net, struct tipc_aead *aead, skb_reset_network_header(*skb); skb_pull(*skb, tipc_ehdr_size(ehdr)); - pskb_trim(*skb, (*skb)->len - aead->authsize); + if (pskb_trim(*skb, (*skb)->len - aead->authsize)) + goto free_skb; /* Validate TIPCv2 message */ if (unlikely(!tipc_msg_validate(skb))) { From ecb4534b6a1c8a9c01d4d1d532d58fc18f23c8da Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Wed, 26 Jul 2023 12:08:28 -0700 Subject: [PATCH 48/49] af_unix: Terminate sun_path when bind()ing pathname socket. kernel test robot reported slab-out-of-bounds access in strlen(). [0] Commit 06d4c8a80836 ("af_unix: Fix fortify_panic() in unix_bind_bsd().") removed unix_mkname_bsd() call in unix_bind_bsd(). If sunaddr->sun_path is not terminated by user and we don't enable CONFIG_INIT_STACK_ALL_ZERO=y, strlen() will do the out-of-bounds access during file creation. Let's go back to strlen()-with-sockaddr_storage way and pack all 108 trickiness into unix_mkname_bsd() with bold comments. [0]: BUG: KASAN: slab-out-of-bounds in strlen (lib/string.c:?) Read of size 1 at addr ffff000015492777 by task fortify_strlen_/168 CPU: 0 PID: 168 Comm: fortify_strlen_ Not tainted 6.5.0-rc1-00333-g3329b603ebba #16 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace (arch/arm64/kernel/stacktrace.c:235) show_stack (arch/arm64/kernel/stacktrace.c:242) dump_stack_lvl (lib/dump_stack.c:107) print_report (mm/kasan/report.c:365 mm/kasan/report.c:475) kasan_report (mm/kasan/report.c:590) __asan_report_load1_noabort (mm/kasan/report_generic.c:378) strlen (lib/string.c:?) getname_kernel (./include/linux/fortify-string.h:? fs/namei.c:226) kern_path_create (fs/namei.c:3926) unix_bind (net/unix/af_unix.c:1221 net/unix/af_unix.c:1324) __sys_bind (net/socket.c:1792) __arm64_sys_bind (net/socket.c:1801) invoke_syscall (arch/arm64/kernel/syscall.c:? arch/arm64/kernel/syscall.c:52) el0_svc_common (./include/linux/thread_info.h:127 arch/arm64/kernel/syscall.c:147) do_el0_svc (arch/arm64/kernel/syscall.c:189) el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:648) el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:?) el0t_64_sync (arch/arm64/kernel/entry.S:591) Allocated by task 168: kasan_set_track (mm/kasan/common.c:45 mm/kasan/common.c:52) kasan_save_alloc_info (mm/kasan/generic.c:512) __kasan_kmalloc (mm/kasan/common.c:383) __kmalloc (mm/slab_common.c:? mm/slab_common.c:998) unix_bind (net/unix/af_unix.c:257 net/unix/af_unix.c:1213 net/unix/af_unix.c:1324) __sys_bind (net/socket.c:1792) __arm64_sys_bind (net/socket.c:1801) invoke_syscall (arch/arm64/kernel/syscall.c:? arch/arm64/kernel/syscall.c:52) el0_svc_common (./include/linux/thread_info.h:127 arch/arm64/kernel/syscall.c:147) do_el0_svc (arch/arm64/kernel/syscall.c:189) el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:648) el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:?) el0t_64_sync (arch/arm64/kernel/entry.S:591) The buggy address belongs to the object at ffff000015492700 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 0 bytes to the right of allocated 119-byte region [ffff000015492700, ffff000015492777) The buggy address belongs to the physical page: page:00000000aeab52ba refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x55492 anon flags: 0x3fffc0000000200(slab|node=0|zone=0|lastcpupid=0xffff) page_type: 0xffffffff() raw: 03fffc0000000200 ffff0000084018c0 fffffc00003d0e00 0000000000000005 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff000015492600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff000015492680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff000015492700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 07 fc ^ ffff000015492780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff000015492800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 06d4c8a80836 ("af_unix: Fix fortify_panic() in unix_bind_bsd().") Reported-by: kernel test robot Closes: https://lore.kernel.org/netdev/202307262110.659e5e8-oliver.sang@intel.com/ Signed-off-by: Kuniyuki Iwashima Reviewed-by: Kees Cook Link: https://lore.kernel.org/r/20230726190828.47874-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni --- net/unix/af_unix.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index bbacf4c60fe3ac..78585217f61a69 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -289,17 +289,29 @@ static int unix_validate_addr(struct sockaddr_un *sunaddr, int addr_len) return 0; } -static void unix_mkname_bsd(struct sockaddr_un *sunaddr, int addr_len) +static int unix_mkname_bsd(struct sockaddr_un *sunaddr, int addr_len) { + struct sockaddr_storage *addr = (struct sockaddr_storage *)sunaddr; + short offset = offsetof(struct sockaddr_storage, __data); + + BUILD_BUG_ON(offset != offsetof(struct sockaddr_un, sun_path)); + /* This may look like an off by one error but it is a bit more * subtle. 108 is the longest valid AF_UNIX path for a binding. * sun_path[108] doesn't as such exist. However in kernel space * we are guaranteed that it is a valid memory location in our * kernel address buffer because syscall functions always pass * a pointer of struct sockaddr_storage which has a bigger buffer - * than 108. + * than 108. Also, we must terminate sun_path for strlen() in + * getname_kernel(). + */ + addr->__data[addr_len - offset] = 0; + + /* Don't pass sunaddr->sun_path to strlen(). Otherwise, 108 will + * cause panic if CONFIG_FORTIFY_SOURCE=y. Let __fortify_strlen() + * know the actual buffer. */ - ((char *)sunaddr)[addr_len] = 0; + return strlen(addr->__data) + offset + 1; } static void __unix_remove_socket(struct sock *sk) @@ -1208,8 +1220,7 @@ static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr, struct path parent; int err; - addr_len = strnlen(sunaddr->sun_path, sizeof(sunaddr->sun_path)) - + offsetof(struct sockaddr_un, sun_path) + 1; + addr_len = unix_mkname_bsd(sunaddr, addr_len); addr = unix_create_addr(sunaddr, addr_len); if (!addr) return -ENOMEM; From de52e17326c3e9a719c9ead4adb03467b8fae0ef Mon Sep 17 00:00:00 2001 From: Fedor Pchelkin Date: Wed, 26 Jul 2023 00:46:25 +0300 Subject: [PATCH 49/49] tipc: stop tipc crypto on failure in tipc_node_create If tipc_link_bc_create() fails inside tipc_node_create() for a newly allocated tipc node then we should stop its tipc crypto and free the resources allocated with a call to tipc_crypto_start(). As the node ref is initialized to one to that point, just put the ref on tipc_link_bc_create() error case that would lead to tipc_node_free() be eventually executed and properly clean the node and its crypto resources. Found by Linux Verification Center (linuxtesting.org). Fixes: cb8092d70a6f ("tipc: move bc link creation back to tipc_node_create") Suggested-by: Xin Long Signed-off-by: Fedor Pchelkin Reviewed-by: Xin Long Link: https://lore.kernel.org/r/20230725214628.25246-1-pchelkin@ispras.ru Signed-off-by: Paolo Abeni --- net/tipc/node.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/tipc/node.c b/net/tipc/node.c index 5e000fde806768..a9c5b6594889b4 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -583,7 +583,7 @@ struct tipc_node *tipc_node_create(struct net *net, u32 addr, u8 *peer_id, n->capabilities, &n->bc_entry.inputq1, &n->bc_entry.namedq, snd_l, &n->bc_entry.link)) { pr_warn("Broadcast rcv link creation failed, no memory\n"); - kfree(n); + tipc_node_put(n); n = NULL; goto exit; }