EFA provider: CUDA memory registration fails with EFAULT — missing dmabuf path for FI_HMEM_CUDA

## Summary

The EFA provider's `efa_mr_reg_ibv_mr()` in `prov/efa/src/efa_mr.c` does not attempt `ibv_reg_dmabuf_mr` for CUDA device memory, unlike Neuron and ROCr. It falls through to plain `ibv_reg_mr()` with the GPU virtual address, which returns `EFAULT` because the kernel EFA driver cannot resolve GPU addresses without dmabuf.

There is already a TODO in the code acknowledging this gap (in the [aws/libfabric](https://github.com/aws/libfabric) fork, tag `v2.4.0amzn1.0`):

```c
/*
 * TODO: need such fallback for cuda as well when
 * FI_CUDA_API_PERMITTED is true
 */
if (efa_mr_is_neuron(efa_mr) || efa_mr_is_rocr(efa_mr)) {
```

## Environment

- **Platform**: AWS GB200 (p6e-gb200.36xlarge), aarch64 Grace Blackwell
- **EFA SDK**: v1.47.0, libfabric v2.4.0amzn1.0
- **EFA kernel module**: 2.15.3g
- **NVIDIA driver**: 580.95.05 (GPU Operator)
- **CUDA**: 13.x (dmabuf support confirmed: `cuda dmabuf support status: 1`)
- **Workload**: NIXL disaggregated KV cache transfer using the LIBFABRIC backend with `FI_HMEM_CUDA`

## Reproduction

Register a large CUDA buffer (~11 GB KV cache) via `fi_mr_regattr()` with `attr->iface = FI_HMEM_CUDA` on the EFA provider.

**Error output:**
```
libfabric::efa:mr:efa_mr_reg_impl():893<warn> Unable to register MR of 11279546368 bytes: Bad address, flags 0
libfabric::efa:mr:efa_mr_regattr():1060<warn> Unable to register MR: Bad address
```

## Root Cause

In `efa_mr_reg_ibv_mr()` (line ~549 of `prov/efa/src/efa_mr.c`), the dmabuf path via `ofi_hmem_get_dmabuf_fd()` + `ibv_reg_dmabuf_mr()` is only attempted for Neuron and ROCr interfaces. For CUDA, execution falls through to the default `ibv_reg_mr()` at the end of the function, which passes the GPU virtual address directly. The kernel returns `EFAULT` because GPU memory cannot be pinned via standard `get_user_pages()`.

The `efa_nv_peermem` kernel module does not intercept this path — it is not an `ib_core` peer memory client in the upstream kernel sense.

Modern CUDA drivers (12.x+) support `cuMemGetHandleForAddressRange()` for dmabuf export, and libfabric's `cuda_get_dmabuf_fd()` already works (confirmed by `cuda_hmem_detect_dmabuf_support()` returning status 1 during init). The infrastructure is all in place; the condition just needs to include CUDA.

## Fix

Add `efa_mr_is_cuda(efa_mr)` to the existing dmabuf condition:

```c
// Before (line ~549):
if (efa_mr_is_neuron(efa_mr) || efa_mr_is_rocr(efa_mr)) {

// After:
if (efa_mr_is_neuron(efa_mr) || efa_mr_is_rocr(efa_mr) ||
    efa_mr_is_cuda(efa_mr)) {
```

This makes the EFA provider call `ofi_hmem_get_dmabuf_fd(FI_HMEM_CUDA, ...)` to obtain a dmabuf fd, then use `ibv_reg_dmabuf_mr()` for the registration. If dmabuf is not supported, it falls back to `ibv_reg_mr()` (same as the existing Neuron/ROCr behavior).

We have validated this fix on the GB200 platform. With the one-line change, 11 GB VRAM buffers register successfully and NIXL disaggregated inference runs end-to-end:

```
libfabric_rail_manager.cpp:480] Registered memory on rail 2 (mr=0x28653320, key=7340312)
libfabric_backend.cpp:811] Rail Manager successfully registered VRAM memory on 1 rails with GPU Direct RDMA support
```

**Note:** This issue is specific to the EFA provider in the [aws/libfabric](https://github.com/aws/libfabric) fork (issues are disabled on that repo). The EFA provider code is maintained by AWS. CC @shijin-aws @shuozhang-amzn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EFA provider: CUDA memory registration fails with EFAULT — missing dmabuf path for FI_HMEM_CUDA #12019

Summary

Environment

Reproduction

Root Cause

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

EFA provider: CUDA memory registration fails with EFAULT — missing dmabuf path for FI_HMEM_CUDA #12019

Description

Summary

Environment

Reproduction

Root Cause

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions