ucp_mem_map Failure with AMD 7900XTX GPU #10592
Replies: 4 comments 8 replies
-
@chris-barnes-at-etherform-com Thank you for filing the issue. This is most likely not a UCX issue itself, but a system setup question:
Unfortunately, the precise combination of linux kernel and OFED versions that one needs to have at the moment is a bit tricky, because one needs to either have versions that will still allow the ib_peer_mem memory registration mechanism to work, or a version of the kernel that supports the dmabuf based registration (the ROCm runtime has a few additional requirements that are met in newer kernels and newer distros, but not necessarily in some older ones) |
Beta Was this translation helpful? Give feedback.
-
To your other question, iommu has to be set to pt (passthrough) and amd_iommu=on |
Beta Was this translation helpful? Give feedback.
-
@chris-barnes-at-etherform-com can you ping me individually? I think we can sort it out in an email chain or chat easier, and then just post a solution here |
Beta Was this translation helpful? Give feedback.
-
For anyone who comes here in the future: Setting the environment variable UCX_ROCM_COPY_DMABUF=y fixed the issue. :) Thank you @edgargabriel Kernel must be capable of supporting DMABuf. I noticed that "rocminfo" will tell you if it does (towards the top). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am trying to get peer-to-peer working between a Mellanox/NVidia ConnectX-6 and AMD 7900xtx GPU using UCX. I believe I have everything correctly working up until the point where I am registering the memory region to get the memory handle using ucp_mem_map().
when I call ucp_mem_map, I see the following trace:
[1743444013.720397] [eng1:13991:0] rcache.c:368 UCX TRACE ucp_rcache: find regions in 0x7ffe8d600000..0x7ffe8fffffff
[1743444013.720400] [eng1:13991:0] rcache.c:368 UCX TRACE ucp_rcache: find regions in 0x7ffe94000000..0x7ffe955fffff
[1743444013.720402] [eng1:13991:0] rcache.c:368 UCX TRACE ucp_rcache: find regions in 0x7ffe8c000000..0x7ffe8fffffff
[1743444013.720403] [eng1:13991:0] rcache.c:368 UCX TRACE ucp_rcache: find regions in 0x7fffeb97e000..0x7fffeb9befff
[1743444013.720405] [eng1:13991:0] rcache.c:368 UCX TRACE ucp_rcache: find regions in 0x7ffe80000000..0x7ffe83ffffff
[1743444013.720407] [eng1:13991:0] rcache.c:368 UCX TRACE ucp_rcache: find regions in 0x7ffeea000000..0x7ffeea3efb1f
[1743444013.720408] [eng1:13991:0] rcache.c:368 UCX TRACE ucp_rcache: find regions in 0x7ffeea000000..0x7ffee9ffffff
[1743444013.720410] [eng1:13991:0] rcache.c:368 UCX TRACE ucp_rcache: find regions in 0x7ffeea3efb20..0x7ffeea3efb1f
[1743444013.721948] [eng1:13991:0] ib_md.c:287 UCX ERROR ibv_reg_mr(address=0x7ffeea000000, length=4127520, access=0xf) failed: Invalid argument
[1743444013.721955] [eng1:13991:0] ucp_mm.c:76 UCX ERROR failed to register address 0x7ffeea000000 (rocm) length 4127520 on md[0]=rxe_dev: Input/output error (md supports: host|rocm)
[1743444013.721958] [eng1:13991:0] rcache.c:1036 UCX DEBUG failed to register region 0x1314bf0 [0x7ffeea000000..0x7ffeea3efb20]: Input/output error
At the moment I am running on my development machine with two processes trying to communicate with each other. My simple test is to move memory from GPU memory to host memory through UCX. The GPU memory is allocated in HIP, then loaded with test data (using normal HIP calls) then I register the memory using the device memory address. At this point I see the trace posted above.
UCX is compiled with ROCM support, IOMMU is enabled. I am currently using the soft-RDMA driver (could this be the issue?). I see that the GPU is in an IOMMU group all by itself (group 14).
In the trace it appears to be that UCX recognizes that the memory is "rocm", and then says that the network device supports "host|rocm" memory types. The specific error in "ibv_reg_mem" seems to indicate that there is an error registering that memory with remote read/write access (note that it incorrectly shows "Invalid argument" as the error, the documents indicate that the error message actually means that the "access" parameter is wrong).
Does anyone have any suggestions on how to resolve this issue?
One other issue that I am seeing is that when I run "rocminfo" I see "IOMMU Support:: None" here, despite the fact that the device is in an iommu group. Is this important? Is there a way to resolve this?
Thank you!
Environment variables:
UCX_TLS: rc,ud,dc,rocm_copy,rocm_ipc
HSA_ENABLE_SDMA: 1
UCX_RNDV_SCHEME: put_zcopy
HSA_FORCE_FINE_GRAIN_PCIE: 1
UCX_LOG_LEVEL: TRACE
I am using UCX v1.9 (whatever is on the main branch as of about two weeks ago.)
Beta Was this translation helpful? Give feedback.
All reactions