[v2.4.x] prov/efa: use dmabuf for CUDA MR registration#12216
Open
sara4dev wants to merge 1 commit into
Open
Conversation
Contributor
|
Can you please sign your commit? |
a-szegel
reviewed
May 5, 2026
Contributor
Author
|
@shijin-aws - I see AWS EFA Also when can we expect AWS EFA to bundle libfabric 2.5.x? |
Signed-off-by: Saravana Periyasamy <saperiyasamy@nvidia.com>
a98352b to
475e538
Compare
Author
|
@a-szegel signed my commit now |
5 tasks
yifjiang
added a commit
to yifjiang/dynamo
that referenced
this pull request
May 19, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; defensive 1.46.0 pin not strictly needed because our patched libfabric overwrites the stock binary anyway). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a comment noting that 1.48.0 is broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) — defensive guidance for future bumpers. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Also: while editing aws.Dockerfile, fix the related ofi-nccl rm path typo that's the subject of ai-dynamo#9703 (rm /opt/amazon/aws-ofi-nccl was a no-op because the EFA installer puts the plugin at /opt/amazon/ofi-nccl/). Doing it here in one commit since the surrounding RUN block changes substantially anyway — happy to drop this and rebase if ai-dynamo#9703 lands first. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path (this PR includes the same fix; can rebase if needed) - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang
added a commit
to yifjiang/dynamo
that referenced
this pull request
May 19, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; our patched libfabric overwrites the stock binary regardless of underlying EFA installer version, so a defensive 1.46.0 pin is not strictly needed). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. The related ofi-nccl rm path fix (Gap B in our investigation; the existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA installer doesn't create) is intentionally NOT in this PR — it's owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main; if before, this PR will need a trivial conflict resolution on the existing rm line. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang
added a commit
to yifjiang/dynamo
that referenced
this pull request
May 19, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; our patched libfabric overwrites the stock binary regardless of underlying EFA installer version, so a defensive 1.46.0 pin is not strictly needed). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. The related ofi-nccl rm path fix (Gap B in our investigation; the existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA installer doesn't create) is intentionally NOT in this PR — it's owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main; if before, this PR will need a trivial conflict resolution on the existing rm line. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang
added a commit
to yifjiang/dynamo
that referenced
this pull request
May 20, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; our patched libfabric overwrites the stock binary regardless of underlying EFA installer version, so a defensive 1.46.0 pin is not strictly needed). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. The related ofi-nccl rm path fix (Gap B in our investigation; the existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA installer doesn't create) is intentionally NOT in this PR — it's owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main; if before, this PR will need a trivial conflict resolution on the existing rm line. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang
added a commit
to yifjiang/dynamo
that referenced
this pull request
May 20, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; our patched libfabric overwrites the stock binary regardless of underlying EFA installer version, so a defensive 1.46.0 pin is not strictly needed). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. The related ofi-nccl rm path fix (Gap B in our investigation; the existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA installer doesn't create) is intentionally NOT in this PR — it's owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main; if before, this PR will need a trivial conflict resolution on the existing rm line. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang
added a commit
to yifjiang/dynamo
that referenced
this pull request
May 21, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; our patched libfabric overwrites the stock binary regardless of underlying EFA installer version, so a defensive 1.46.0 pin is not strictly needed). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. The related ofi-nccl rm path fix (Gap B in our investigation; the existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA installer doesn't create) is intentionally NOT in this PR — it's owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main; if before, this PR will need a trivial conflict resolution on the existing rm line. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ofi_hmem_get_dmabuf_fd()andibv_reg_dmabuf_mr()flow forFI_HMEM_CUDARoot Cause
CUDA HMEM registrations without
FI_MR_DMABUFwere not included in the EFA provider's dmabuf path, unlike Neuron and ROCr. That made CUDA memory fall through to plainibv_reg_mr()with a GPU virtual address, which can fail withEFAULT.References #12019.
Validation
Tested in a GB200 cluster