Skip to content

Commit 827e37b

Browse files
wenjianhnintel-lab-lkp
authored andcommitted
RDMA/mlx5: Use mlx5_cmd_is_down to detect PCIe Surprise Link Down
mlx5r_umr_post_send_wait() will get stuck when the pcie link is down as the call trace[1]. When pciehp detects the link is down it calls pci_dev_set_disconnected() before mlx5_ib_dereg_mr(). Thus we can use mlx5_cmd_is_down() to detect PCIe Surprise Link Down in mlx5r_umr_post_send(). [1] pcieport 0000:b9:01.0: pciehp: Slot(2-4): Link Down pcieport 0000:b9:01.0: pciehp: Slot(2-4): Card not present mlx5_core 0000:bb:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) mlx5_core 0000:bb:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) mlx5_core 0000:bb:00.0: poll_health:1083:(pid 0): Fatal error 1 detected mlx5_core 0000:bb:00.0: print_health_info:491:(pid 0): PCI slot is unavailable INFO: task irq/105-pciehp:1246 blocked for more than 122 seconds. Tainted: G OE 6.8.0-54-generic torvalds#56-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:irq/105-pciehp state:D stack:0 pid:1246 tgid:1246 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x27c/0x6b0 schedule+0x33/0x110 schedule_timeout+0x157/0x170 wait_for_completion+0x88/0x150 mlx5r_umr_post_send_wait+0x15f/0x2d0 [mlx5_ib] ? __pfx_mlx5r_umr_done+0x10/0x10 [mlx5_ib] mlx5r_umr_revoke_mr+0x98/0xc0 [mlx5_ib] __mlx5_ib_dereg_mr+0x24a/0x740 [mlx5_ib] ? wait_for_completion+0x114/0x150 mlx5_ib_dereg_mr+0x21/0xc0 [mlx5_ib] ? rdma_restrack_del+0x59/0x160 [ib_core] ib_dereg_mr_user+0x41/0xc0 [ib_core] uverbs_free_mr+0x15/0x30 [ib_uverbs] destroy_hw_idr_uobject+0x21/0x60 [ib_uverbs] uverbs_destroy_uobject+0x38/0x1d0 [ib_uverbs] __uverbs_cleanup_ufile+0xcf/0x150 [ib_uverbs] uverbs_destroy_ufile_hw+0x3f/0x100 [ib_uverbs] ib_uverbs_remove_one+0x147/0x1c0 [ib_uverbs] remove_client_context+0x95/0x100 [ib_core] disable_device+0x8f/0x180 [ib_core] __ib_unregister_device+0x108/0x170 [ib_core] ? __pfx_mlx5_ib_stage_ib_reg_cleanup+0x10/0x10 [mlx5_ib] ib_unreister_device+0x26/0x40 [ib_core] mlx5_ib_stage_ib_reg_cleanup+0xe/0x20 [mlx5_ib] mlx5r_remove+0x52/0xb0 [mlx5_ib] auxiliary_bus_remove+0x1c/0x40 device_remove+0x40/0x80 device_release_driver_internal+0x20b/0x270 device_release_driver+0x12/0x20 bus_remove_device+0xcb/0x140 device_del+0x161/0x3e0 ? is_ib_enabled+0x52/0x90 [mlx5_core] mlx5_rescan_drivers_locked+0xfe/0x350 [mlx5_core] mlx5_unregister_device+0x38/0x60 [mlx5_core] mlx5_uninit_one+0x39/0x160 [mlx5_core] remove_one+0x55/0x100 [mlx5_core] pci_device_remove+0x3e/0xb0 device_remove+0x40/0x80 device_release_driver_internal+0x20b/0x270 device_release_driver+0x12/0x20 pci_stop_bus_device+0x7a/0xb0 pci_stop_and_remove_bus_device+0x12/0x30 pciehp_unconfigure_device+0x98/0x170 pciehp_disable_slot+0x69/0x130 pciehp_handle_presence_or_link_change+0x71/0x220 pciehp_ist+0x22e/0x260 ? __pfx_irq_thread_fn+0x10/0x10 irq_thread_fn+0x21/0x70 irq_thread+0xf8/0x1c0 ? __pfx_irq_thread_dtor+0x10/0x10 ? __pfx_irq_thread+0x10/0x10 kthread+0xef/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x44/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Fixes: 6f0689f ("RDMA/mlx5: Introduce mlx5_umr_post_send_wait()") Signed-off-by: Jian Wen <wenjian1@xiaomi.com>
1 parent e6d736b commit 827e37b

File tree

1 file changed

+1
-1
lines changed
  • drivers/infiniband/hw/mlx5

1 file changed

+1
-1
lines changed

drivers/infiniband/hw/mlx5/umr.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ static int mlx5r_umr_post_send(struct ib_qp *ibqp, u32 mkey, struct ib_cqe *cqe,
254254
unsigned int idx;
255255
int size, err;
256256

257-
if (unlikely(mdev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR))
257+
if (unlikely(mlx5_cmd_is_down(mdev)))
258258
return -EIO;
259259

260260
spin_lock_irqsave(&qp->sq.lock, flags);

0 commit comments

Comments
 (0)