build(container): bump nixl_gdrcopy_ref to v2.5.2 (kernel >=6.15 fix)#9705
Conversation
|
👋 Hi yifjiang! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; defensive 1.46.0 pin not strictly needed because our patched libfabric overwrites the stock binary anyway). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a comment noting that 1.48.0 is broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) — defensive guidance for future bumpers. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Also: while editing aws.Dockerfile, fix the related ofi-nccl rm path typo that's the subject of ai-dynamo#9703 (rm /opt/amazon/aws-ofi-nccl was a no-op because the EFA installer puts the plugin at /opt/amazon/ofi-nccl/). Doing it here in one commit since the surrounding RUN block changes substantially anyway — happy to drop this and rebase if ai-dynamo#9703 lands first. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path (this PR includes the same fix; can rebase if needed) - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; our patched libfabric overwrites the stock binary regardless of underlying EFA installer version, so a defensive 1.46.0 pin is not strictly needed). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. The related ofi-nccl rm path fix (Gap B in our investigation; the existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA installer doesn't create) is intentionally NOT in this PR — it's owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main; if before, this PR will need a trivial conflict resolution on the existing rm line. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
Bumps `nixl_ref: 0.10.1 → v1.1.0` in all four occurrences in container/context.yaml (one in `dynamo:` common section, one each in the `vllm:`, `sglang:`, `trtllm:` framework sections). NIXL v1.1.0 brings four fixes that matter on AWS p6e-gb200 EFA fabric: ai-dynamo#1461 NUMA-aware EFA rail selection — without this, the LIBFABRIC backend assigns all initiator GPUs to a single rail and caps aggregate bandwidth at ~1.79 GB/s on p6e-gb200 instead of the expected ~190 GB/s with 4 GPUs + 4 EFA NICs per pod. ai-dynamo#1510 Active rail tracking for multi-rail concurrent transfers. ai-dynamo#1506 Multi-GPU memory-region fix (relevant for TP > 1 workers registering large VRAM blocks across GPUs). ai-dynamo#1433 Transfer handle repost notification fix. These performance gains were validated by yutwu (NVIDIA teammate) during the GLM-5.1 v7→v7.5 image bump on AWS p6e-gb200. For dynamo-trtllm disagg over EFA, NIXL 0.10.1's rail policy is the difference between "looks correct, caps at 1.79 GB/s" and "190 GB/s aggregate". ### Note on the version-tag format `ai-dynamo/nixl` tags are mixed: - Older releases: `0.1.1`, `0.10.0`, `0.10.1` (no `v` prefix) - Newer releases: `v1.1.0` (with `v` prefix) This bump uses `v1.1.0` to match the upstream tag. Dynamo's wheel_builder.Dockerfile clones via `git checkout ${NIXL_REF}` so the ref must be exactly the tag name. Verified the `v1.1.0` tag exists at https://github.com/ai-dynamo/nixl/releases/tag/v1.1.0. ### Risk MEDIUM. This is a major version bump (0.x → 1.x). Specifically: - The C++ `libnixl.so` ABI is the primary compat concern. NIXL 1.1.0 introduced Abseil >= 20240116 as a build dependency (VLOG/absl_log). Dynamo's wheel_builder uses Ubuntu 24.04 system Abseil 20220623, which lacks these symbols. The EFA patch script in our image-build path (yutwu's `install_efa_libfabric_nixl.sh`) builds a newer Abseil from source as a workaround; dynamo's wheel_builder needs the same. Either: (a) Add `libabsl-dev` >= 20240116 to wheel_builder.Dockerfile's apt install list (Ubuntu 24.04.4 LTS or backports may have it). (b) Build Abseil from source in wheel_builder.Dockerfile similar to how we build libfabric. (c) Use the system Abseil and patch NIXL 1.1.0 to not require absl_log — non-starter, upstream change. This PR does NOT include the Abseil bump — it's only the version pin. Maintainers will need to validate the build and add the Abseil dep in a coordinated change (or a follow-up PR I can write once reviewers confirm option a/b). - dynamo's Python NIXL bindings (`nixl-cu13` wheel) may have API changes between 0.10.1 and 1.1.0. Dynamo's serving code that imports `nixl` needs to be verified compatible. A quick smoke (worker init + KV transfer) is sufficient. - dynamo's plugin loader behavior may differ. The fix2 image-build cycle hit a related issue where dynamo's NIXL 0.10.1 plugin's Abseil ABI conflicted with NIXL 1.1.0's plugin in the SAME image (the `install_efa_libfabric_nixl.sh` post-process layered NIXL 1.1.0 nixlbench on top of dynamo's bundled 0.10.1). With this bump, dynamo's serving plugin AND nixlbench's plugin both use NIXL 1.1.0 → that conflict goes away. Marking this PR as **draft** until: 1. Maintainer agrees on the Abseil dep approach (a/b/c above). 2. `trtllm-pipeline` / `vllm-pipeline` / `sglang-pipeline` CI runs pass with the bump. 3. A quick smoke of dynamo serving with the rebuilt NIXL confirms no API break. I can iterate as needed. ### References - NIXL v1.1.0 release: https://github.com/ai-dynamo/nixl/releases/tag/v1.1.0 - Companion PRs: ai-dynamo#9703 (template fixes), ai-dynamo#9704 (render --has-trtllm-context), ai-dynamo#9705 (gdrcopy v2.5.2 bump) Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
Bumps `nixl_ref: 0.10.1 → v1.1.0` in all four occurrences in container/context.yaml (one in `dynamo:` common section, one each in the `vllm:`, `sglang:`, `trtllm:` framework sections), and adds the Abseil source-build prereq that NIXL >= 1.0.0 requires. ### Why bump NIXL NIXL v1.1.0 brings four fixes that matter on AWS p6e-gb200 EFA fabric: ai-dynamo#1461 NUMA-aware EFA rail selection — without this, the LIBFABRIC backend assigns all initiator GPUs to a single rail and caps aggregate bandwidth at ~1.79 GB/s on p6e-gb200 instead of the expected ~190 GB/s with 4 GPUs + 4 EFA NICs per pod. ai-dynamo#1510 Active rail tracking for multi-rail concurrent transfers. ai-dynamo#1506 Multi-GPU memory-region fix (relevant for TP > 1 workers registering large VRAM blocks across GPUs). ai-dynamo#1433 Transfer handle repost notification fix. These gains were validated by yutwu (NVIDIA teammate) during the GLM-5.1 v7→v7.5 image bump on AWS p6e-gb200. For dynamo-trtllm disagg over EFA, NIXL 0.10.1's rail policy is the difference between "looks correct, caps at 1.79 GB/s" and "190 GB/s aggregate". ### The Abseil prereq NIXL >= 1.0.0 uses VLOG(1)/DVLOG(2) in nixl_log.h, which require Abseil >= 20240116. NIXL's meson.build first searches for system Abseil via pkg-config: absl_base_dep = dependency('absl_base', required: false) absl_log_dep = dependency('absl_log', required: false) ... if absl_base_dep.found() and not absl_log_dep.found() error('Your Abseil version is too old: found absl_base but missing support for absl_log. Cannot fallback to subproject because that would result in a mix of Abseil versions at runtime.') If pkg-config finds an old Abseil (absl_base present but no absl_log), NIXL HARD-ERRORS at configure time. Subproject fallback only triggers when NO system Abseil is found. This trips two ways for dynamo: 1. **wheel_builder (AlmaLinux 8 / manylinux_2_28)** typically lacks system Abseil, so subproject fallback would work — BUT the subproject builds shared libs with SONAMEs like libabsl_*.so.20250814. 2. **runtime images (cuda-dl-base Ubuntu 24.04)** ship stock libabsl-dev 20220623 (libabsl_*.so.20220623 SONAMEs). NIXL's libs built against subproject 20250814 can't dlopen against the runtime image's 20220623 — SONAME mismatch. The clean fix is to pre-install Abseil consistently at /usr/local in wheel_builder so meson uses it deterministically, then propagate the .so files to runtime stages. This matches yutwu's validated approach in the GLM-5.1 EFA-patch script (`install_abseil_from_source`). ### What changes 1. `container/context.yaml`: - Bump `nixl_ref: 0.10.1 → v1.1.0` in all 4 sections. - Add `dynamo.abseil_ref: 20240722.0` (the Abseil LTS yutwu validated). 2. `container/templates/args.Dockerfile`: - Declare `ARG ABSEIL_REF` (gated by `{% if device == "cuda" %}`, same as the other NIXL-related ARGs). 3. `container/templates/wheel_builder.Dockerfile`: - New RUN block BEFORE the NIXL clone+build that source-builds Abseil ${ABSEIL_REF} to /usr/local with `BUILD_SHARED_LIBS=ON` + `ABSL_ENABLE_INSTALL=ON`. - Gated by `pkg-config --exists absl_log` as a no-op if a future base image already ships a recent Abseil. - Validates `pkg-config --modversion absl_log` succeeds after install (build fails if not). 4. `container/templates/dynamo_runtime.Dockerfile`: - New COPY line bringing /usr/local/lib/libabsl_*.so* from wheel_builder so libnixl can resolve its Abseil deps at dlopen. 5. `container/templates/trtllm_runtime.Dockerfile`: - Same Abseil .so COPY. trtllm_runtime overrides the dynamo runtime stage, so it needs its own COPY independently. vllm and sglang frameworks are unaffected at the runtime level — they use upstream image NIXL packages (nixl-cu12 from vllm-openai, sglang's bundled NIXL), not dynamo's wheel_builder NIXL. Their wheel_builder stage still builds Abseil (gated by `device == "cuda"`), which is consistent with the existing pattern of building UCX/NIXL/etc. in wheel_builder regardless of whether the framework runtime uses them. ### Why source-build (and not apt-install a newer libabsl-dev) I evaluated `apt-install libabsl-dev` as an option. Ubuntu 24.04's main archive ships `libabsl-dev 20220623.1-1build1`, which is exactly the version NIXL hard-errors on. Backports / -updates / -proposed don't have a newer version. yutwu hit the same investigation and arrived at source-build as the only working path. If a maintainer knows of a repo (NVIDIA apt, PPA, etc.) with a newer libabsl-dev that works on Ubuntu 24.04 / cuda-dl-base, I'm happy to swap source-build for apt-install — the source-build adds ~2-3 min of build time per arch. ### Risk MEDIUM. This is a major version bump (0.x → 1.x) plus a new build dependency. Specifically: - The C++ libnixl.so ABI is the primary compat concern. Subproject fallback would have produced a 20250814 Abseil SONAME mismatch at runtime; this PR avoids that by pinning to 20240722.0 consistently across build + runtime stages. - Python NIXL bindings may have changed across 0.10.1 → v1.1.0. Dynamo's serving code that imports `nixl` needs verification. Worker init + a basic KV transfer smoke is sufficient. - Image size: +~30-40 MB for the Abseil shared libs in runtime stages (libabsl_log, libabsl_base, libabsl_strings, libabsl_status, libabsl_synchronization, libabsl_time, libabsl_flat_hash_map, etc. plus their dependency closure). - Build time: +~2-3 min in wheel_builder for the Abseil compile. ### Tag-format note `ai-dynamo/nixl` mixes tag conventions (older releases lack `v` prefix, newer ones use it): - Older releases: `0.1.1`, `0.10.0`, `0.10.1` (no `v` prefix) - Newer releases: `v1.1.0` (with `v` prefix) This bump uses `v1.1.0` to match the upstream tag. wheel_builder uses `git checkout ${NIXL_REF}` so the value must be exactly the tag name. ### What's needed before this can merge - [ ] `trtllm-pipeline` / `vllm-pipeline` / `sglang-pipeline` / `dynamo-pipeline` CI pass with both Abseil source-build and NIXL v1.1.0. - [ ] Manual smoke: rebuild a trtllm-runtime image, deploy a 1P1D Qwen3-8B disagg, confirm KV transfer works end-to-end. - [ ] (Stretch) `nixlbench` from inside the rebuilt image hits the expected ~190 GB/s on full-node p6e-gb200 allocation. ### References - NIXL v1.1.0 release: https://github.com/ai-dynamo/nixl/releases/tag/v1.1.0 - NIXL meson Abseil logic: https://github.com/ai-dynamo/nixl/blob/v1.1.0/meson.build (the hard-error block we're working around) Companion PRs: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - this PR — `build(container)`: nixl_ref v1.1.0 + Abseil prereq - ai-dynamo#9727 — `feat(container)`: patched libfabric in aws stage Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
Bumps `nixl_ref: 0.10.1 → v1.1.0` in all four occurrences in container/context.yaml (one in `dynamo:` common section, one each in the `vllm:`, `sglang:`, `trtllm:` framework sections), and adds the Abseil source-build prereq that NIXL >= 1.0.0 requires. ### Why bump NIXL NIXL v1.1.0 brings four fixes that matter on AWS p6e-gb200 EFA fabric: ai-dynamo#1461 NUMA-aware EFA rail selection — without this, the LIBFABRIC backend assigns all initiator GPUs to a single rail and caps aggregate bandwidth at ~1.79 GB/s on p6e-gb200 instead of the expected ~190 GB/s with 4 GPUs + 4 EFA NICs per pod. ai-dynamo#1510 Active rail tracking for multi-rail concurrent transfers. ai-dynamo#1506 Multi-GPU memory-region fix (relevant for TP > 1 workers registering large VRAM blocks across GPUs). ai-dynamo#1433 Transfer handle repost notification fix. These gains were validated by yutwu (NVIDIA teammate) during the GLM-5.1 v7→v7.5 image bump on AWS p6e-gb200. For dynamo-trtllm disagg over EFA, NIXL 0.10.1's rail policy is the difference between "looks correct, caps at 1.79 GB/s" and "190 GB/s aggregate". ### The Abseil prereq NIXL >= 1.0.0 uses VLOG(1)/DVLOG(2) in nixl_log.h, which require Abseil >= 20240116. NIXL's meson.build first searches for system Abseil via pkg-config: absl_base_dep = dependency('absl_base', required: false) absl_log_dep = dependency('absl_log', required: false) ... if absl_base_dep.found() and not absl_log_dep.found() error('Your Abseil version is too old: found absl_base but missing support for absl_log. Cannot fallback to subproject because that would result in a mix of Abseil versions at runtime.') If pkg-config finds an old Abseil (absl_base present but no absl_log), NIXL HARD-ERRORS at configure time. Subproject fallback only triggers when NO system Abseil is found. This trips two ways for dynamo: 1. **wheel_builder (AlmaLinux 8 / manylinux_2_28)** typically lacks system Abseil, so subproject fallback would work — BUT the subproject builds shared libs with SONAMEs like libabsl_*.so.20250814. 2. **runtime images (cuda-dl-base Ubuntu 24.04)** ship stock libabsl-dev 20220623 (libabsl_*.so.20220623 SONAMEs). NIXL's libs built against subproject 20250814 can't dlopen against the runtime image's 20220623 — SONAME mismatch. The clean fix is to pre-install Abseil consistently at /usr/local in wheel_builder so meson uses it deterministically, then propagate the .so files to runtime stages. This matches yutwu's validated approach in the GLM-5.1 EFA-patch script (`install_abseil_from_source`). ### What changes 1. `container/context.yaml`: - Bump `nixl_ref: 0.10.1 → v1.1.0` in all 4 sections. - Add `dynamo.abseil_ref: 20240722.0` (the Abseil LTS yutwu validated). 2. `container/templates/args.Dockerfile`: - Declare `ARG ABSEIL_REF` (gated by `{% if device == "cuda" %}`, same as the other NIXL-related ARGs). 3. `container/templates/wheel_builder.Dockerfile`: - New RUN block BEFORE the NIXL clone+build that source-builds Abseil ${ABSEIL_REF} to /usr/local with `BUILD_SHARED_LIBS=ON` + `ABSL_ENABLE_INSTALL=ON`. - Gated by `pkg-config --exists absl_log` as a no-op if a future base image already ships a recent Abseil. - Validates `pkg-config --modversion absl_log` succeeds after install (build fails if not). 4. `container/templates/dynamo_runtime.Dockerfile`: - New COPY line bringing /usr/local/lib/libabsl_*.so* from wheel_builder so libnixl can resolve its Abseil deps at dlopen. 5. `container/templates/trtllm_runtime.Dockerfile`: - Same Abseil .so COPY. trtllm_runtime overrides the dynamo runtime stage, so it needs its own COPY independently. vllm and sglang frameworks are unaffected at the runtime level — they use upstream image NIXL packages (nixl-cu12 from vllm-openai, sglang's bundled NIXL), not dynamo's wheel_builder NIXL. Their wheel_builder stage still builds Abseil (gated by `device == "cuda"`), which is consistent with the existing pattern of building UCX/NIXL/etc. in wheel_builder regardless of whether the framework runtime uses them. ### Why source-build (and not apt-install a newer libabsl-dev) I evaluated `apt-install libabsl-dev` as an option. Ubuntu 24.04's main archive ships `libabsl-dev 20220623.1-1build1`, which is exactly the version NIXL hard-errors on. Backports / -updates / -proposed don't have a newer version. yutwu hit the same investigation and arrived at source-build as the only working path. If a maintainer knows of a repo (NVIDIA apt, PPA, etc.) with a newer libabsl-dev that works on Ubuntu 24.04 / cuda-dl-base, I'm happy to swap source-build for apt-install — the source-build adds ~2-3 min of build time per arch. ### Risk MEDIUM. This is a major version bump (0.x → 1.x) plus a new build dependency. Specifically: - The C++ libnixl.so ABI is the primary compat concern. Subproject fallback would have produced a 20250814 Abseil SONAME mismatch at runtime; this PR avoids that by pinning to 20240722.0 consistently across build + runtime stages. - Python NIXL bindings may have changed across 0.10.1 → v1.1.0. Dynamo's serving code that imports `nixl` needs verification. Worker init + a basic KV transfer smoke is sufficient. - Image size: +~30-40 MB for the Abseil shared libs in runtime stages (libabsl_log, libabsl_base, libabsl_strings, libabsl_status, libabsl_synchronization, libabsl_time, libabsl_flat_hash_map, etc. plus their dependency closure). - Build time: +~2-3 min in wheel_builder for the Abseil compile. ### Tag-format note `ai-dynamo/nixl` mixes tag conventions (older releases lack `v` prefix, newer ones use it): - Older releases: `0.1.1`, `0.10.0`, `0.10.1` (no `v` prefix) - Newer releases: `v1.1.0` (with `v` prefix) This bump uses `v1.1.0` to match the upstream tag. wheel_builder uses `git checkout ${NIXL_REF}` so the value must be exactly the tag name. ### What's needed before this can merge - [ ] `trtllm-pipeline` / `vllm-pipeline` / `sglang-pipeline` / `dynamo-pipeline` CI pass with both Abseil source-build and NIXL v1.1.0. - [ ] Manual smoke: rebuild a trtllm-runtime image, deploy a 1P1D Qwen3-8B disagg, confirm KV transfer works end-to-end. - [ ] (Stretch) `nixlbench` from inside the rebuilt image hits the expected ~190 GB/s on full-node p6e-gb200 allocation. ### References - NIXL v1.1.0 release: https://github.com/ai-dynamo/nixl/releases/tag/v1.1.0 - NIXL meson Abseil logic: https://github.com/ai-dynamo/nixl/blob/v1.1.0/meson.build (the hard-error block we're working around) Companion PRs: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - this PR — `build(container)`: nixl_ref v1.1.0 + Abseil prereq - ai-dynamo#9727 — `feat(container)`: patched libfabric in aws stage Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
Bumps `nixl_ref: 0.10.1 → v1.1.0` in all four occurrences in container/context.yaml (one in `dynamo:` common section, one each in the `vllm:`, `sglang:`, `trtllm:` framework sections), and adds the Abseil source-build prereq that NIXL >= 1.0.0 requires. ### Why bump NIXL NIXL v1.1.0 brings four fixes that matter on AWS p6e-gb200 EFA fabric: ai-dynamo#1461 NUMA-aware EFA rail selection — without this, the LIBFABRIC backend assigns all initiator GPUs to a single rail and caps aggregate bandwidth at ~1.79 GB/s on p6e-gb200 instead of the expected ~190 GB/s with 4 GPUs + 4 EFA NICs per pod. ai-dynamo#1510 Active rail tracking for multi-rail concurrent transfers. ai-dynamo#1506 Multi-GPU memory-region fix (relevant for TP > 1 workers registering large VRAM blocks across GPUs). ai-dynamo#1433 Transfer handle repost notification fix. These gains were validated by yutwu (NVIDIA teammate) during the GLM-5.1 v7→v7.5 image bump on AWS p6e-gb200. For dynamo-trtllm disagg over EFA, NIXL 0.10.1's rail policy is the difference between "looks correct, caps at 1.79 GB/s" and "190 GB/s aggregate". ### The Abseil prereq NIXL >= 1.0.0 uses VLOG(1)/DVLOG(2) in nixl_log.h, which require Abseil >= 20240116. NIXL's meson.build first searches for system Abseil via pkg-config: absl_base_dep = dependency('absl_base', required: false) absl_log_dep = dependency('absl_log', required: false) ... if absl_base_dep.found() and not absl_log_dep.found() error('Your Abseil version is too old: found absl_base but missing support for absl_log. Cannot fallback to subproject because that would result in a mix of Abseil versions at runtime.') If pkg-config finds an old Abseil (absl_base present but no absl_log), NIXL HARD-ERRORS at configure time. Subproject fallback only triggers when NO system Abseil is found. This trips two ways for dynamo: 1. **wheel_builder (AlmaLinux 8 / manylinux_2_28)** typically lacks system Abseil, so subproject fallback would work — BUT the subproject builds shared libs with SONAMEs like libabsl_*.so.20250814. 2. **runtime images (cuda-dl-base Ubuntu 24.04)** ship stock libabsl-dev 20220623 (libabsl_*.so.20220623 SONAMEs). NIXL's libs built against subproject 20250814 can't dlopen against the runtime image's 20220623 — SONAME mismatch. The clean fix is to pre-install Abseil consistently at /usr/local in wheel_builder so meson uses it deterministically, then propagate the .so files to runtime stages. This matches yutwu's validated approach in the GLM-5.1 EFA-patch script (`install_abseil_from_source`). ### What changes 1. `container/context.yaml`: - Bump `nixl_ref: 0.10.1 → v1.1.0` in all 4 sections. - Add `dynamo.abseil_ref: 20240722.0` (the Abseil LTS yutwu validated). 2. `container/templates/args.Dockerfile`: - Declare `ARG ABSEIL_REF` (gated by `{% if device == "cuda" %}`, same as the other NIXL-related ARGs). 3. `container/templates/wheel_builder.Dockerfile`: - New RUN block BEFORE the NIXL clone+build that source-builds Abseil ${ABSEIL_REF} to /usr/local with `BUILD_SHARED_LIBS=ON` + `ABSL_ENABLE_INSTALL=ON`. - Gated by `pkg-config --exists absl_log` as a no-op if a future base image already ships a recent Abseil. - Validates `pkg-config --modversion absl_log` succeeds after install (build fails if not). 4. `container/templates/dynamo_runtime.Dockerfile`: - New COPY line bringing /usr/local/lib/libabsl_*.so* from wheel_builder so libnixl can resolve its Abseil deps at dlopen. 5. `container/templates/trtllm_runtime.Dockerfile`: - Same Abseil .so COPY. trtllm_runtime overrides the dynamo runtime stage, so it needs its own COPY independently. vllm and sglang frameworks are unaffected at the runtime level — they use upstream image NIXL packages (nixl-cu12 from vllm-openai, sglang's bundled NIXL), not dynamo's wheel_builder NIXL. Their wheel_builder stage still builds Abseil (gated by `device == "cuda"`), which is consistent with the existing pattern of building UCX/NIXL/etc. in wheel_builder regardless of whether the framework runtime uses them. ### Why source-build (and not apt-install a newer libabsl-dev) I evaluated `apt-install libabsl-dev` as an option. Ubuntu 24.04's main archive ships `libabsl-dev 20220623.1-1build1`, which is exactly the version NIXL hard-errors on. Backports / -updates / -proposed don't have a newer version. yutwu hit the same investigation and arrived at source-build as the only working path. If a maintainer knows of a repo (NVIDIA apt, PPA, etc.) with a newer libabsl-dev that works on Ubuntu 24.04 / cuda-dl-base, I'm happy to swap source-build for apt-install — the source-build adds ~2-3 min of build time per arch. ### Risk MEDIUM. This is a major version bump (0.x → 1.x) plus a new build dependency. Specifically: - The C++ libnixl.so ABI is the primary compat concern. Subproject fallback would have produced a 20250814 Abseil SONAME mismatch at runtime; this PR avoids that by pinning to 20240722.0 consistently across build + runtime stages. - Python NIXL bindings may have changed across 0.10.1 → v1.1.0. Dynamo's serving code that imports `nixl` needs verification. Worker init + a basic KV transfer smoke is sufficient. - Image size: +~30-40 MB for the Abseil shared libs in runtime stages (libabsl_log, libabsl_base, libabsl_strings, libabsl_status, libabsl_synchronization, libabsl_time, libabsl_flat_hash_map, etc. plus their dependency closure). - Build time: +~2-3 min in wheel_builder for the Abseil compile. ### Tag-format note `ai-dynamo/nixl` mixes tag conventions (older releases lack `v` prefix, newer ones use it): - Older releases: `0.1.1`, `0.10.0`, `0.10.1` (no `v` prefix) - Newer releases: `v1.1.0` (with `v` prefix) This bump uses `v1.1.0` to match the upstream tag. wheel_builder uses `git checkout ${NIXL_REF}` so the value must be exactly the tag name. ### What's needed before this can merge - [ ] `trtllm-pipeline` / `vllm-pipeline` / `sglang-pipeline` / `dynamo-pipeline` CI pass with both Abseil source-build and NIXL v1.1.0. - [ ] Manual smoke: rebuild a trtllm-runtime image, deploy a 1P1D Qwen3-8B disagg, confirm KV transfer works end-to-end. - [ ] (Stretch) `nixlbench` from inside the rebuilt image hits the expected ~190 GB/s on full-node p6e-gb200 allocation. ### References - NIXL v1.1.0 release: https://github.com/ai-dynamo/nixl/releases/tag/v1.1.0 - NIXL meson Abseil logic: https://github.com/ai-dynamo/nixl/blob/v1.1.0/meson.build (the hard-error block we're working around) Companion PRs: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - this PR — `build(container)`: nixl_ref v1.1.0 + Abseil prereq - ai-dynamo#9727 — `feat(container)`: patched libfabric in aws stage Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; our patched libfabric overwrites the stock binary regardless of underlying EFA installer version, so a defensive 1.46.0 pin is not strictly needed). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. The related ofi-nccl rm path fix (Gap B in our investigation; the existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA installer doesn't create) is intentionally NOT in this PR — it's owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main; if before, this PR will need a trivial conflict resolution on the existing rm line. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; our patched libfabric overwrites the stock binary regardless of underlying EFA installer version, so a defensive 1.46.0 pin is not strictly needed). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. The related ofi-nccl rm path fix (Gap B in our investigation; the existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA installer doesn't create) is intentionally NOT in this PR — it's owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main; if before, this PR will need a trivial conflict resolution on the existing rm line. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; our patched libfabric overwrites the stock binary regardless of underlying EFA installer version, so a defensive 1.46.0 pin is not strictly needed). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. The related ofi-nccl rm path fix (Gap B in our investigation; the existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA installer doesn't create) is intentionally NOT in this PR — it's owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main; if before, this PR will need a trivial conflict resolution on the existing rm line. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
… GDRCopy known issues Three additions to docs/kubernetes/cloud-providers/eks/efa.md, all inside the existing Known Issues block: 1. Issue 1 (libfabric CUDA dmabuf): the existing workaround was missing three defensive steps that cause it to silently produce a broken image on GB200 — SONAME symlink force (make install does NOT overwrite the EFA installer's libfabric.so.1 → stock 1.30.x lookup wins), stock binary cleanup (defends against hardcoded RPATHs), and build-time fi_info --version validation (fails the build instead of failing in production). Added all three inline, with a Tracking PR pointer to ai-dynamo#9727 which upstreams the same fix. 2. Issue 2 (NEW): TRT-LLM rc14's libtensorrt_llm_nixl_wrapper.so is compiled against NIXL 0.9.x and references types dropped in NIXL ≥ 1.0 (nixlDescList<nixlBlobDesc>, <nixlBasicDesc>). Bumping dynamo's nixl_ref to v1.1.0 CrashLoopBackOffs every TRT-LLM disagg pod at executor init. Workaround: keep nixl_ref at 0.10.1; don't merge ai-dynamo#9706 until TRT-LLM upgrades its wrapper. 3. Issue 3 (NEW): GDRCopy v2.5.1 source RPM fails to compile against host kernel ≥ 6.15 (vm_flags_set redefinition). /dev/gdrdrv missing → GPU-Direct RDMA falls back to slower paths. Workaround: bump nixl_gdrcopy_ref to v2.5.2 (tracked in ai-dynamo#9705). Also added a summary table at the top of Known Issues for at-a-glance triage, and two new rows to the Common Failure Modes table mapping the new symptom signatures to Issues 2 and 3. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…el-6.15 known issue Two additions to docs/kubernetes/cloud-providers/eks/efa.md, both inside the existing Known Issues block: 1. Issue 1 (libfabric CUDA dmabuf): the existing workaround was missing three defensive steps that cause it to silently produce a broken image on GB200 — SONAME symlink force (make install does NOT overwrite the EFA installer's libfabric.so.1 → stock 1.30.x lookup wins), stock binary cleanup (defends against hardcoded RPATHs), and build-time fi_info --version validation (fails the build instead of failing in production). Added all three inline, with a Tracking PR pointer to ai-dynamo#9727 which upstreams the same fix. 2. Issue 2 (NEW): GDRCopy v2.5.1 source RPM fails to compile against host kernel ≥ 6.15 (vm_flags_set redefinition). /dev/gdrdrv missing → GPU-Direct RDMA falls back to slower paths. Workaround: bump nixl_gdrcopy_ref to v2.5.2 (tracked in ai-dynamo#9705). Also added a summary table at the top of Known Issues for at-a-glance triage, and one new row to the Common Failure Modes table mapping the new symptom signature to Issue 2. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
GDRCopy v2.5.1 fails to build on host kernel ≥ 6.15 with a `vm_flags_set` redefinition error (Linux 6.15 changed the symbol's declaration). v2.5.2 fixes this. NVIDIA/gdrcopy v2.5.2 release notes: https://github.com/NVIDIA/gdrcopy/releases/tag/v2.5.2 This matters for EKS deployments on AMIs that ship kernels ≥ 6.15 (currently most non-Amazon-Linux Ubuntu 24.04 nodes). With v2.5.1, the wheel_builder stage builds the GDRCopy *userspace* library fine but the resulting kmod can't compile against a 6.15+ host kernel — relevant when the host needs to load the kmod for GPU Direct RDMA. The userspace library v2.5.2 is fully backward-compatible with v2.5.1 APIs, so the wheel_builder stage and NIXL linkage are unaffected. Tested: - render trtllm runtime with `--make-efa`, build the wheel_builder stage, confirm `gdrcopy/CHANGELOG.md` inside the image lists 2.5.2 Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
44da767 to
1970b32
Compare
|
/ok to test 1970b32 |
The `--make-efa` aws stage currently installs only the stock AWS EFA userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric (2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg falls through to ibv_reg_mr() with a GPU virtual address and returns EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use case for dynamo-trtllm disagg. The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR ofiwg/libfabric#12216). Until that lands in the EFA installer's default libfabric, every `--make-efa` image needs to build and overlay the patched version. This PR moves that overlay from a downstream post-process script (the AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1` path, which we've been running on top of dynamo images for the v9-efa fix1/fix2 cycle) into the aws.Dockerfile template itself. What changes: - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is (1.47.0 empirically works on cuda-dl-base; our patched libfabric overwrites the stock binary regardless of underlying EFA installer version, so a defensive 1.46.0 pin is not strictly needed). Add new keys `dynamo.patched_libfabric_repo` and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it. - container/templates/args.Dockerfile: declare two new ARGs (gated by `{% if make_efa == true %}`). - container/templates/aws.Dockerfile: after the existing EFA installer RUN, add a second RUN that installs build deps, clones the patched libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`, installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer can populate either; whichever ldconfig sees first wins), deletes stock `libfabric.so.1.30.*` binaries (defends against hardcoded RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports the patched libfabric. The build fails if validation fails — turns a deployment-time error into a build-time error. Non-EFA images are completely unchanged: the patched-libfabric build is gated by `{% if make_efa == true %}` in args.Dockerfile and only renders into the aws stage. The related ofi-nccl rm path fix (Gap B in our investigation; the existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA installer doesn't create) is intentionally NOT in this PR — it's owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main; if before, this PR will need a trivial conflict resolution on the existing rm line. ### Validation evidence Built and validated as the v9-efa-fix2 images: - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2` (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337) - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2` (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22) Both images shipped this exact patched libfabric stack via a layered post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200. ### Risk MEDIUM. The aws stage now compiles libfabric from source, which adds ~3-5 min of build time and ~200-300 MB of image size (build deps left installed for simplicity — separate PR can purge them if size is a concern). Non-EFA paths unaffected. The build deps added (autoconf, automake, libtool, make, build-essential, pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev, rdma-core, ca-certificates, git) are a superset of what's already pulled in by the EFA installer's own `apt install`, so the marginal cost is small. cuda-dl-base provides `/usr/local/cuda`, which the patched libfabric needs for `--with-cuda`. ### Companion PRs This PR is the final missing piece to internalize what the `install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces: - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2 - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil dep work as a follow-up) - this PR — `feat(container)`: patched libfabric in aws stage Together: a single `python3 container/render.py --framework trtllm --target runtime --platform linux/amd64 --cuda-version 13.1 --make-efa --has-trtllm-context` followed by `docker build ...` produces an EFA-correct image with no post-process script needed. nixlbench (the validation tool) is intentionally NOT included — it can live as a separate `--make-nixlbench` flag if desired, or stay as an external layer. Marked draft as RFC. Maintainers may prefer a different code layout (separate builder stage to keep build deps out of the runtime, different default repo/ref, or different validation approach). Happy to iterate. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…el-6.15 known issue Two additions to docs/kubernetes/cloud-providers/eks/efa.md, both inside the existing Known Issues block: 1. Issue 1 (libfabric CUDA dmabuf): the existing workaround was missing three defensive steps that cause it to silently produce a broken image on GB200 — SONAME symlink force (make install does NOT overwrite the EFA installer's libfabric.so.1 → stock 1.30.x lookup wins), stock binary cleanup (defends against hardcoded RPATHs), and build-time fi_info --version validation (fails the build instead of failing in production). Added all three inline, with a Tracking PR pointer to ai-dynamo#9727 which upstreams the same fix. 2. Issue 2 (NEW): GDRCopy v2.5.1 source RPM fails to compile against host kernel ≥ 6.15 (vm_flags_set redefinition). /dev/gdrdrv missing → GPU-Direct RDMA falls back to slower paths. Workaround: bump nixl_gdrcopy_ref to v2.5.2 (tracked in ai-dynamo#9705). Also added a summary table at the top of Known Issues for at-a-glance triage, and one new row to the Common Failure Modes table mapping the new symptom signature to Issue 2. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe ChangesConfiguration version update
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…es are merged Removes the "Known Issues" section from docs/kubernetes/cloud-providers/eks/efa.md and prunes the two now-obsolete rows from "Common Failure Modes". Assumes ai-dynamo#9703, ai-dynamo#9704, ai-dynamo#9705, and ai-dynamo#9727 are all merged — after those land, the issues this section documented (GB200 fi_mr_reg(VRAM) failure on the EFA installer's stock libfabric, and GDRCopy v2.5.1 kmod build failure on kernel >= 6.15) no longer affect default --make-efa builds, so the inline workarounds the section provided would mislead readers. Also removes the ofiwg/libfabric#12019 reference from the bottom links list since it points at the same now-resolved upstream issue. Net diff: -34 / +1. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
Summary
Bumps
container/context.yaml: dynamo.nixl_gdrcopy_reffromv2.5.1tov2.5.2. One-line change.Why
GDRCopy v2.5.1 fails to build on host kernel ≥ 6.15:
Linux 6.15 changed
vm_flags_setdeclaration; GDRCopy v2.5.2 adapts.The dynamo wheel_builder stage builds the GDRCopy userspace library fine on the build host (which is typically a 6.14 kernel slurm node), but the resulting kmod source RPM can't compile against a 6.15+ kernel when the GPU Operator on the deploy host eventually tries to build it. This blocks GPU Direct RDMA on any deploy AMI that ships a kernel ≥ 6.15 — which is increasingly the default for Ubuntu 24.04 EKS nodes.
Surfaced during EFA validation on AWS p6e-gb200 (2026-05-09 v9-efa image build cycle); flagged as kernel constraint in the v9-efa recipe docs.
Companion PRs
This is one of four small PRs internalizing fixes that were previously layered via the external
install_efa_libfabric_nixl_fix2.shscript:fix(container): ofi-nccl rm pathfeat(container): render.py--has-trtllm-contextflagbuild(container): nixl_gdrcopy_ref v2.5.1 → v2.5.2feat(container): build upstream libfabric (v2.5.1) into the aws stageTogether, these four merged make
python3 container/render.py --framework trtllm --target runtime --cuda-version 13.1 --make-efa --has-trtllm-context && docker build ... --target aws ...produce an EFA-correct image with no post-process. Validated end-to-end via the v4 internalized image (seeprs-internalized-v4-validation-2026-05-21.md) — Qwen3-Coder-480B-A35B-Instruct-FP4 on GB200 + Qwen3-30B-A3B-FP8 on H100, both READY 3/3 with 0 restarts.Risk
LOW. v2.5.2 is fully API-backward-compatible with v2.5.1. The wheel_builder stage and NIXL's linkage against GDRCopy are unaffected. The only behavioral change is that the resulting kmod source RPM can build against newer kernels at deploy time.
Test plan
python3 container/render.py --framework trtllm --target wheel_builder --cuda-version 13.1 --platform linux/amd64renders cleanly withNIXL_GDRCOPY_REF=v2.5.2.dmesg | grep gdrcopyshows novm_flags_setredef error.🤖 Generated with Claude Code
Summary by CodeRabbit