[v2.4.x] prov/efa: use dmabuf for CUDA MR registration by sara4dev · Pull Request #12216 · ofiwg/libfabric

sara4dev · 2026-05-05T06:44:22Z

Summary

include CUDA in the EFA provider's implicit dmabuf MR registration path
use the existing ofi_hmem_get_dmabuf_fd() and ibv_reg_dmabuf_mr() flow for FI_HMEM_CUDA
remove the stale TODO that said CUDA still needed this fallback

Root Cause

CUDA HMEM registrations without FI_MR_DMABUF were not included in the EFA provider's dmabuf path, unlike Neuron and ROCr. That made CUDA memory fall through to plain ibv_reg_mr() with a GPU virtual address, which can fail with EFAULT.

References #12019.

Validation

Tested in a GB200 cluster

a-szegel · 2026-05-05T15:50:59Z

Can you please sign your commit? git commit --amend -s

shijin-aws · 2026-05-07T18:04:02Z

@sara4dev These fixes are already part of v2.5.x as part of c9e3c0c, would u mind trying it ? or you prefer to stay on v2.4.x

sara4dev · 2026-05-11T17:38:29Z

@shijin-aws - I see AWS EFA 1.48.0 still uses libfabric 2.4.0amzn3.0. Can we update libfabric alone to 2.5.x?

Also when can we expect AWS EFA to bundle libfabric 2.5.x?

Signed-off-by: Saravana Periyasamy <saperiyasamy@nvidia.com>

sara4dev · 2026-05-11T17:55:10Z

@a-szegel signed my commit now

The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; defensive 1.46.0 pin not strictly needed because our patched libfabric overwrites the stock binary anyway). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a comment noting that 1.48.0 is broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) — defensive guidance for future bumpers. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Also: while editing aws.Dockerfile, fix the related ofi-nccl rm path typo that's the subject of ai-dynamo#9703 (rm /opt/amazon/aws-ofi-nccl was a no-op because the EFA installer puts the plugin at /opt/amazon/ofi-nccl/). Doing it here in one commit since the surrounding RUN block changes substantially anyway — happy to drop this and rebase if ai-dynamo#9703 lands first. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path (this PR includes the same fix; can rebase if needed) - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; our patched libfabric overwrites the stock binary regardless of underlying EFA installer version, so a defensive 1.46.0 pin is not strictly needed). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. The related ofi-nccl rm path fix (Gap B in our investigation; the existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA installer doesn't create) is intentionally NOT in this PR — it's owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main; if before, this PR will need a trivial conflict resolution on the existing rm line. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

sara4dev marked this pull request as ready for review May 5, 2026 06:51

a-szegel reviewed May 5, 2026

View reviewed changes

Comment thread prov/efa/src/efa_mr.c

prov/efa: use dmabuf for CUDA MR registration

475e538

Signed-off-by: Saravana Periyasamy <saperiyasamy@nvidia.com>

sara4dev force-pushed the fix-libfabric-bug-12019-in-v2.4.x-branch branch from a98352b to 475e538 Compare May 11, 2026 17:54

yifjiang mentioned this pull request May 19, 2026

feat(container): configurable libfabric repo + v2.5.1 overlay for EFA ai-dynamo/dynamo#9727

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2.4.x] prov/efa: use dmabuf for CUDA MR registration#12216

[v2.4.x] prov/efa: use dmabuf for CUDA MR registration#12216
sara4dev wants to merge 1 commit into
ofiwg:v2.4.xfrom
sara4dev:fix-libfabric-bug-12019-in-v2.4.x-branch

sara4dev commented May 5, 2026 •

edited

Loading

Uh oh!

a-szegel commented May 5, 2026

Uh oh!

Uh oh!

shijin-aws commented May 7, 2026 •

edited

Loading

Uh oh!

sara4dev commented May 11, 2026

Uh oh!

sara4dev commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sara4dev commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Validation

Uh oh!

a-szegel commented May 5, 2026

Uh oh!

Uh oh!

shijin-aws commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sara4dev commented May 11, 2026

Uh oh!

sara4dev commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sara4dev commented May 5, 2026 •

edited

Loading

shijin-aws commented May 7, 2026 •

edited

Loading