Skip to content

[v2.4.x] prov/efa: use dmabuf for CUDA MR registration#12216

Open
sara4dev wants to merge 1 commit into
ofiwg:v2.4.xfrom
sara4dev:fix-libfabric-bug-12019-in-v2.4.x-branch
Open

[v2.4.x] prov/efa: use dmabuf for CUDA MR registration#12216
sara4dev wants to merge 1 commit into
ofiwg:v2.4.xfrom
sara4dev:fix-libfabric-bug-12019-in-v2.4.x-branch

Conversation

@sara4dev
Copy link
Copy Markdown

@sara4dev sara4dev commented May 5, 2026

Summary

  • include CUDA in the EFA provider's implicit dmabuf MR registration path
  • use the existing ofi_hmem_get_dmabuf_fd() and ibv_reg_dmabuf_mr() flow for FI_HMEM_CUDA
  • remove the stale TODO that said CUDA still needed this fallback

Root Cause

CUDA HMEM registrations without FI_MR_DMABUF were not included in the EFA provider's dmabuf path, unlike Neuron and ROCr. That made CUDA memory fall through to plain ibv_reg_mr() with a GPU virtual address, which can fail with EFAULT.

References #12019.

Validation

Tested in a GB200 cluster

@sara4dev sara4dev marked this pull request as ready for review May 5, 2026 06:51
@a-szegel
Copy link
Copy Markdown
Contributor

a-szegel commented May 5, 2026

Can you please sign your commit? git commit --amend -s

Comment thread prov/efa/src/efa_mr.c
@shijin-aws
Copy link
Copy Markdown
Contributor

shijin-aws commented May 7, 2026

@sara4dev These fixes are already part of v2.5.x as part of c9e3c0c, would u mind trying it ? or you prefer to stay on v2.4.x

@sara4dev
Copy link
Copy Markdown
Author

@shijin-aws - I see AWS EFA 1.48.0 still uses libfabric 2.4.0amzn3.0. Can we update libfabric alone to 2.5.x?

Also when can we expect AWS EFA to bundle libfabric 2.5.x?

Signed-off-by: Saravana Periyasamy <saperiyasamy@nvidia.com>
@sara4dev sara4dev force-pushed the fix-libfabric-bug-12019-in-v2.4.x-branch branch from a98352b to 475e538 Compare May 11, 2026 17:54
@sara4dev
Copy link
Copy Markdown
Author

@a-szegel signed my commit now

yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 19, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; defensive 1.46.0 pin
    not strictly needed because our patched libfabric overwrites the
    stock binary anyway). Add new keys `dynamo.patched_libfabric_repo`
    and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric`
    and `v2.3.1amzn4.0`. Add a comment noting that 1.48.0 is broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) — defensive guidance for future
    bumpers.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Also: while editing aws.Dockerfile, fix the related ofi-nccl rm path
typo that's the subject of ai-dynamo#9703 (rm /opt/amazon/aws-ofi-nccl was a no-op
because the EFA installer puts the plugin at /opt/amazon/ofi-nccl/).
Doing it here in one commit since the surrounding RUN block changes
substantially anyway — happy to drop this and rebase if ai-dynamo#9703 lands first.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path (this PR includes the
            same fix; can rebase if needed)
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 19, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; our patched libfabric
    overwrites the stock binary regardless of underlying EFA installer
    version, so a defensive 1.46.0 pin is not strictly needed). Add
    new keys `dynamo.patched_libfabric_repo` and
    `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and
    `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

The related ofi-nccl rm path fix (Gap B in our investigation; the
existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA
installer doesn't create) is intentionally NOT in this PR — it's
owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main;
if before, this PR will need a trivial conflict resolution on the
existing rm line.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 19, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; our patched libfabric
    overwrites the stock binary regardless of underlying EFA installer
    version, so a defensive 1.46.0 pin is not strictly needed). Add
    new keys `dynamo.patched_libfabric_repo` and
    `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and
    `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

The related ofi-nccl rm path fix (Gap B in our investigation; the
existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA
installer doesn't create) is intentionally NOT in this PR — it's
owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main;
if before, this PR will need a trivial conflict resolution on the
existing rm line.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 20, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; our patched libfabric
    overwrites the stock binary regardless of underlying EFA installer
    version, so a defensive 1.46.0 pin is not strictly needed). Add
    new keys `dynamo.patched_libfabric_repo` and
    `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and
    `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

The related ofi-nccl rm path fix (Gap B in our investigation; the
existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA
installer doesn't create) is intentionally NOT in this PR — it's
owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main;
if before, this PR will need a trivial conflict resolution on the
existing rm line.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 20, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; our patched libfabric
    overwrites the stock binary regardless of underlying EFA installer
    version, so a defensive 1.46.0 pin is not strictly needed). Add
    new keys `dynamo.patched_libfabric_repo` and
    `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and
    `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

The related ofi-nccl rm path fix (Gap B in our investigation; the
existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA
installer doesn't create) is intentionally NOT in this PR — it's
owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main;
if before, this PR will need a trivial conflict resolution on the
existing rm line.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 21, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; our patched libfabric
    overwrites the stock binary regardless of underlying EFA installer
    version, so a defensive 1.46.0 pin is not strictly needed). Add
    new keys `dynamo.patched_libfabric_repo` and
    `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and
    `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

The related ofi-nccl rm path fix (Gap B in our investigation; the
existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA
installer doesn't create) is intentionally NOT in this PR — it's
owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main;
if before, this PR will need a trivial conflict resolution on the
existing rm line.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants