Commit 827e37b
RDMA/mlx5: Use mlx5_cmd_is_down to detect PCIe Surprise Link Down
mlx5r_umr_post_send_wait() will get stuck when the pcie link is down
as the call trace[1].
When pciehp detects the link is down it calls
pci_dev_set_disconnected() before mlx5_ib_dereg_mr(). Thus we can use
mlx5_cmd_is_down() to detect PCIe Surprise Link Down in
mlx5r_umr_post_send().
[1]
pcieport 0000:b9:01.0: pciehp: Slot(2-4): Link Down
pcieport 0000:b9:01.0: pciehp: Slot(2-4): Card not present
mlx5_core 0000:bb:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
mlx5_core 0000:bb:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
mlx5_core 0000:bb:00.0: poll_health:1083:(pid 0): Fatal error 1 detected
mlx5_core 0000:bb:00.0: print_health_info:491:(pid 0): PCI slot is unavailable
INFO: task irq/105-pciehp:1246 blocked for more than 122 seconds.
Tainted: G OE 6.8.0-54-generic torvalds#56-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:irq/105-pciehp state:D stack:0 pid:1246 tgid:1246 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x27c/0x6b0
schedule+0x33/0x110
schedule_timeout+0x157/0x170
wait_for_completion+0x88/0x150
mlx5r_umr_post_send_wait+0x15f/0x2d0 [mlx5_ib]
? __pfx_mlx5r_umr_done+0x10/0x10 [mlx5_ib]
mlx5r_umr_revoke_mr+0x98/0xc0 [mlx5_ib]
__mlx5_ib_dereg_mr+0x24a/0x740 [mlx5_ib]
? wait_for_completion+0x114/0x150
mlx5_ib_dereg_mr+0x21/0xc0 [mlx5_ib]
? rdma_restrack_del+0x59/0x160 [ib_core]
ib_dereg_mr_user+0x41/0xc0 [ib_core]
uverbs_free_mr+0x15/0x30 [ib_uverbs]
destroy_hw_idr_uobject+0x21/0x60 [ib_uverbs]
uverbs_destroy_uobject+0x38/0x1d0 [ib_uverbs]
__uverbs_cleanup_ufile+0xcf/0x150 [ib_uverbs]
uverbs_destroy_ufile_hw+0x3f/0x100 [ib_uverbs]
ib_uverbs_remove_one+0x147/0x1c0 [ib_uverbs]
remove_client_context+0x95/0x100 [ib_core]
disable_device+0x8f/0x180 [ib_core]
__ib_unregister_device+0x108/0x170 [ib_core]
? __pfx_mlx5_ib_stage_ib_reg_cleanup+0x10/0x10 [mlx5_ib]
ib_unreister_device+0x26/0x40 [ib_core]
mlx5_ib_stage_ib_reg_cleanup+0xe/0x20 [mlx5_ib]
mlx5r_remove+0x52/0xb0 [mlx5_ib]
auxiliary_bus_remove+0x1c/0x40
device_remove+0x40/0x80
device_release_driver_internal+0x20b/0x270
device_release_driver+0x12/0x20
bus_remove_device+0xcb/0x140
device_del+0x161/0x3e0
? is_ib_enabled+0x52/0x90 [mlx5_core]
mlx5_rescan_drivers_locked+0xfe/0x350 [mlx5_core]
mlx5_unregister_device+0x38/0x60 [mlx5_core]
mlx5_uninit_one+0x39/0x160 [mlx5_core]
remove_one+0x55/0x100 [mlx5_core]
pci_device_remove+0x3e/0xb0
device_remove+0x40/0x80
device_release_driver_internal+0x20b/0x270
device_release_driver+0x12/0x20
pci_stop_bus_device+0x7a/0xb0
pci_stop_and_remove_bus_device+0x12/0x30
pciehp_unconfigure_device+0x98/0x170
pciehp_disable_slot+0x69/0x130
pciehp_handle_presence_or_link_change+0x71/0x220
pciehp_ist+0x22e/0x260
? __pfx_irq_thread_fn+0x10/0x10
irq_thread_fn+0x21/0x70
irq_thread+0xf8/0x1c0
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
kthread+0xef/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x44/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
Fixes: 6f0689f ("RDMA/mlx5: Introduce mlx5_umr_post_send_wait()")
Signed-off-by: Jian Wen <wenjian1@xiaomi.com>1 parent e6d736b commit 827e37b
1 file changed
+1
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
254 | 254 | | |
255 | 255 | | |
256 | 256 | | |
257 | | - | |
| 257 | + | |
258 | 258 | | |
259 | 259 | | |
260 | 260 | | |
| |||
0 commit comments