Summary
The EFA provider's efa_mr_reg_ibv_mr() in prov/efa/src/efa_mr.c does not attempt ibv_reg_dmabuf_mr for CUDA device memory, unlike Neuron and ROCr. It falls through to plain ibv_reg_mr() with the GPU virtual address, which returns EFAULT because the kernel EFA driver cannot resolve GPU addresses without dmabuf.
There is already a TODO in the code acknowledging this gap (in the aws/libfabric fork, tag v2.4.0amzn1.0):
/*
* TODO: need such fallback for cuda as well when
* FI_CUDA_API_PERMITTED is true
*/
if (efa_mr_is_neuron(efa_mr) || efa_mr_is_rocr(efa_mr)) {
Environment
- Platform: AWS GB200 (p6e-gb200.36xlarge), aarch64 Grace Blackwell
- EFA SDK: v1.47.0, libfabric v2.4.0amzn1.0
- EFA kernel module: 2.15.3g
- NVIDIA driver: 580.95.05 (GPU Operator)
- CUDA: 13.x (dmabuf support confirmed:
cuda dmabuf support status: 1)
- Workload: NIXL disaggregated KV cache transfer using the LIBFABRIC backend with
FI_HMEM_CUDA
Reproduction
Register a large CUDA buffer (~11 GB KV cache) via fi_mr_regattr() with attr->iface = FI_HMEM_CUDA on the EFA provider.
Error output:
libfabric::efa:mr:efa_mr_reg_impl():893<warn> Unable to register MR of 11279546368 bytes: Bad address, flags 0
libfabric::efa:mr:efa_mr_regattr():1060<warn> Unable to register MR: Bad address
Root Cause
In efa_mr_reg_ibv_mr() (line ~549 of prov/efa/src/efa_mr.c), the dmabuf path via ofi_hmem_get_dmabuf_fd() + ibv_reg_dmabuf_mr() is only attempted for Neuron and ROCr interfaces. For CUDA, execution falls through to the default ibv_reg_mr() at the end of the function, which passes the GPU virtual address directly. The kernel returns EFAULT because GPU memory cannot be pinned via standard get_user_pages().
The efa_nv_peermem kernel module does not intercept this path — it is not an ib_core peer memory client in the upstream kernel sense.
Modern CUDA drivers (12.x+) support cuMemGetHandleForAddressRange() for dmabuf export, and libfabric's cuda_get_dmabuf_fd() already works (confirmed by cuda_hmem_detect_dmabuf_support() returning status 1 during init). The infrastructure is all in place; the condition just needs to include CUDA.
Fix
Add efa_mr_is_cuda(efa_mr) to the existing dmabuf condition:
// Before (line ~549):
if (efa_mr_is_neuron(efa_mr) || efa_mr_is_rocr(efa_mr)) {
// After:
if (efa_mr_is_neuron(efa_mr) || efa_mr_is_rocr(efa_mr) ||
efa_mr_is_cuda(efa_mr)) {
This makes the EFA provider call ofi_hmem_get_dmabuf_fd(FI_HMEM_CUDA, ...) to obtain a dmabuf fd, then use ibv_reg_dmabuf_mr() for the registration. If dmabuf is not supported, it falls back to ibv_reg_mr() (same as the existing Neuron/ROCr behavior).
We have validated this fix on the GB200 platform. With the one-line change, 11 GB VRAM buffers register successfully and NIXL disaggregated inference runs end-to-end:
libfabric_rail_manager.cpp:480] Registered memory on rail 2 (mr=0x28653320, key=7340312)
libfabric_backend.cpp:811] Rail Manager successfully registered VRAM memory on 1 rails with GPU Direct RDMA support
Note: This issue is specific to the EFA provider in the aws/libfabric fork (issues are disabled on that repo). The EFA provider code is maintained by AWS. CC @shijin-aws @shuozhang-amzn
Summary
The EFA provider's
efa_mr_reg_ibv_mr()inprov/efa/src/efa_mr.cdoes not attemptibv_reg_dmabuf_mrfor CUDA device memory, unlike Neuron and ROCr. It falls through to plainibv_reg_mr()with the GPU virtual address, which returnsEFAULTbecause the kernel EFA driver cannot resolve GPU addresses without dmabuf.There is already a TODO in the code acknowledging this gap (in the aws/libfabric fork, tag
v2.4.0amzn1.0):Environment
cuda dmabuf support status: 1)FI_HMEM_CUDAReproduction
Register a large CUDA buffer (~11 GB KV cache) via
fi_mr_regattr()withattr->iface = FI_HMEM_CUDAon the EFA provider.Error output:
Root Cause
In
efa_mr_reg_ibv_mr()(line ~549 ofprov/efa/src/efa_mr.c), the dmabuf path viaofi_hmem_get_dmabuf_fd()+ibv_reg_dmabuf_mr()is only attempted for Neuron and ROCr interfaces. For CUDA, execution falls through to the defaultibv_reg_mr()at the end of the function, which passes the GPU virtual address directly. The kernel returnsEFAULTbecause GPU memory cannot be pinned via standardget_user_pages().The
efa_nv_peermemkernel module does not intercept this path — it is not anib_corepeer memory client in the upstream kernel sense.Modern CUDA drivers (12.x+) support
cuMemGetHandleForAddressRange()for dmabuf export, and libfabric'scuda_get_dmabuf_fd()already works (confirmed bycuda_hmem_detect_dmabuf_support()returning status 1 during init). The infrastructure is all in place; the condition just needs to include CUDA.Fix
Add
efa_mr_is_cuda(efa_mr)to the existing dmabuf condition:This makes the EFA provider call
ofi_hmem_get_dmabuf_fd(FI_HMEM_CUDA, ...)to obtain a dmabuf fd, then useibv_reg_dmabuf_mr()for the registration. If dmabuf is not supported, it falls back toibv_reg_mr()(same as the existing Neuron/ROCr behavior).We have validated this fix on the GB200 platform. With the one-line change, 11 GB VRAM buffers register successfully and NIXL disaggregated inference runs end-to-end:
Note: This issue is specific to the EFA provider in the aws/libfabric fork (issues are disabled on that repo). The EFA provider code is maintained by AWS. CC @shijin-aws @shuozhang-amzn