Skip to content

build(container): bump nixl_gdrcopy_ref to v2.5.2 (kernel >=6.15 fix)#9705

Merged
yifjiang merged 1 commit into
ai-dynamo:mainfrom
yifjiang:yifjiang/bump-nixl-gdrcopy-2.5.2
May 22, 2026
Merged

build(container): bump nixl_gdrcopy_ref to v2.5.2 (kernel >=6.15 fix)#9705
yifjiang merged 1 commit into
ai-dynamo:mainfrom
yifjiang:yifjiang/bump-nixl-gdrcopy-2.5.2

Conversation

@yifjiang
Copy link
Copy Markdown
Contributor

@yifjiang yifjiang commented May 18, 2026

Summary

Bumps container/context.yaml: dynamo.nixl_gdrcopy_ref from v2.5.1 to v2.5.2. One-line change.

Why

GDRCopy v2.5.1 fails to build on host kernel ≥ 6.15:

error: redefinition of 'vm_flags_set'

Linux 6.15 changed vm_flags_set declaration; GDRCopy v2.5.2 adapts.

The dynamo wheel_builder stage builds the GDRCopy userspace library fine on the build host (which is typically a 6.14 kernel slurm node), but the resulting kmod source RPM can't compile against a 6.15+ kernel when the GPU Operator on the deploy host eventually tries to build it. This blocks GPU Direct RDMA on any deploy AMI that ships a kernel ≥ 6.15 — which is increasingly the default for Ubuntu 24.04 EKS nodes.

Surfaced during EFA validation on AWS p6e-gb200 (2026-05-09 v9-efa image build cycle); flagged as kernel constraint in the v9-efa recipe docs.

Companion PRs

This is one of four small PRs internalizing fixes that were previously layered via the external install_efa_libfabric_nixl_fix2.sh script:

PR What
#9703 fix(container): ofi-nccl rm path
#9704 feat(container): render.py --has-trtllm-context flag
this PR build(container): nixl_gdrcopy_ref v2.5.1 → v2.5.2
#9727 feat(container): build upstream libfabric (v2.5.1) into the aws stage

Together, these four merged make python3 container/render.py --framework trtllm --target runtime --cuda-version 13.1 --make-efa --has-trtllm-context && docker build ... --target aws ... produce an EFA-correct image with no post-process. Validated end-to-end via the v4 internalized image (see prs-internalized-v4-validation-2026-05-21.md) — Qwen3-Coder-480B-A35B-Instruct-FP4 on GB200 + Qwen3-30B-A3B-FP8 on H100, both READY 3/3 with 0 restarts.

Risk

LOW. v2.5.2 is fully API-backward-compatible with v2.5.1. The wheel_builder stage and NIXL's linkage against GDRCopy are unaffected. The only behavioral change is that the resulting kmod source RPM can build against newer kernels at deploy time.

Test plan

  • Sanity: python3 container/render.py --framework trtllm --target wheel_builder --cuda-version 13.1 --platform linux/amd64 renders cleanly with NIXL_GDRCOPY_REF=v2.5.2.
  • Container builds, libgdrapi present in ldconfig (v4 validation).
  • CI: trtllm-pipeline build (wheel_builder stage clones gdrcopy at the new ref).
  • Manual on a kernel ≥ 6.15 host: load the kmod and confirm dmesg | grep gdrcopy shows no vm_flags_set redef error.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Updated CUDA/NIXL configuration to latest version.

Review Change Stack

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi yifjiang! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions Bot added external-contribution Pull request is from an external contributor build labels May 18, 2026
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 19, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; defensive 1.46.0 pin
    not strictly needed because our patched libfabric overwrites the
    stock binary anyway). Add new keys `dynamo.patched_libfabric_repo`
    and `dynamo.patched_libfabric_ref` with defaults `aws/libfabric`
    and `v2.3.1amzn4.0`. Add a comment noting that 1.48.0 is broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) — defensive guidance for future
    bumpers.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Also: while editing aws.Dockerfile, fix the related ofi-nccl rm path
typo that's the subject of ai-dynamo#9703 (rm /opt/amazon/aws-ofi-nccl was a no-op
because the EFA installer puts the plugin at /opt/amazon/ofi-nccl/).
Doing it here in one commit since the surrounding RUN block changes
substantially anyway — happy to drop this and rebase if ai-dynamo#9703 lands first.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path (this PR includes the
            same fix; can rebase if needed)
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 19, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; our patched libfabric
    overwrites the stock binary regardless of underlying EFA installer
    version, so a defensive 1.46.0 pin is not strictly needed). Add
    new keys `dynamo.patched_libfabric_repo` and
    `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and
    `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

The related ofi-nccl rm path fix (Gap B in our investigation; the
existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA
installer doesn't create) is intentionally NOT in this PR — it's
owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main;
if before, this PR will need a trivial conflict resolution on the
existing rm line.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 19, 2026
Bumps `nixl_ref: 0.10.1 → v1.1.0` in all four occurrences in
container/context.yaml (one in `dynamo:` common section, one each in
the `vllm:`, `sglang:`, `trtllm:` framework sections).

NIXL v1.1.0 brings four fixes that matter on AWS p6e-gb200 EFA fabric:

  ai-dynamo#1461 NUMA-aware EFA rail selection — without this, the LIBFABRIC
        backend assigns all initiator GPUs to a single rail and caps
        aggregate bandwidth at ~1.79 GB/s on p6e-gb200 instead of
        the expected ~190 GB/s with 4 GPUs + 4 EFA NICs per pod.
  ai-dynamo#1510 Active rail tracking for multi-rail concurrent transfers.
  ai-dynamo#1506 Multi-GPU memory-region fix (relevant for TP > 1 workers
        registering large VRAM blocks across GPUs).
  ai-dynamo#1433 Transfer handle repost notification fix.

These performance gains were validated by yutwu (NVIDIA teammate)
during the GLM-5.1 v7→v7.5 image bump on AWS p6e-gb200. For
dynamo-trtllm disagg over EFA, NIXL 0.10.1's rail policy is the
difference between "looks correct, caps at 1.79 GB/s" and "190 GB/s
aggregate".

### Note on the version-tag format

`ai-dynamo/nixl` tags are mixed:
  - Older releases: `0.1.1`, `0.10.0`, `0.10.1` (no `v` prefix)
  - Newer releases: `v1.1.0` (with `v` prefix)

This bump uses `v1.1.0` to match the upstream tag. Dynamo's
wheel_builder.Dockerfile clones via `git checkout ${NIXL_REF}` so the
ref must be exactly the tag name. Verified the `v1.1.0` tag exists at
https://github.com/ai-dynamo/nixl/releases/tag/v1.1.0.

### Risk

MEDIUM. This is a major version bump (0.x → 1.x). Specifically:

  - The C++ `libnixl.so` ABI is the primary compat concern. NIXL 1.1.0
    introduced Abseil >= 20240116 as a build dependency (VLOG/absl_log).
    Dynamo's wheel_builder uses Ubuntu 24.04 system Abseil 20220623,
    which lacks these symbols. The EFA patch script in our image-build
    path (yutwu's `install_efa_libfabric_nixl.sh`) builds a newer Abseil
    from source as a workaround; dynamo's wheel_builder needs the same.
    Either:
      (a) Add `libabsl-dev` >= 20240116 to wheel_builder.Dockerfile's
          apt install list (Ubuntu 24.04.4 LTS or backports may have it).
      (b) Build Abseil from source in wheel_builder.Dockerfile similar
          to how we build libfabric.
      (c) Use the system Abseil and patch NIXL 1.1.0 to not require
          absl_log — non-starter, upstream change.
    This PR does NOT include the Abseil bump — it's only the version
    pin. Maintainers will need to validate the build and add the Abseil
    dep in a coordinated change (or a follow-up PR I can write once
    reviewers confirm option a/b).

  - dynamo's Python NIXL bindings (`nixl-cu13` wheel) may have API
    changes between 0.10.1 and 1.1.0. Dynamo's serving code that
    imports `nixl` needs to be verified compatible. A quick smoke
    (worker init + KV transfer) is sufficient.

  - dynamo's plugin loader behavior may differ. The fix2 image-build
    cycle hit a related issue where dynamo's NIXL 0.10.1 plugin's
    Abseil ABI conflicted with NIXL 1.1.0's plugin in the SAME image
    (the `install_efa_libfabric_nixl.sh` post-process layered NIXL
    1.1.0 nixlbench on top of dynamo's bundled 0.10.1). With this bump,
    dynamo's serving plugin AND nixlbench's plugin both use NIXL 1.1.0
    → that conflict goes away.

Marking this PR as **draft** until:

  1. Maintainer agrees on the Abseil dep approach (a/b/c above).
  2. `trtllm-pipeline` / `vllm-pipeline` / `sglang-pipeline` CI runs
     pass with the bump.
  3. A quick smoke of dynamo serving with the rebuilt NIXL confirms
     no API break.

I can iterate as needed.

### References

- NIXL v1.1.0 release: https://github.com/ai-dynamo/nixl/releases/tag/v1.1.0
- Companion PRs: ai-dynamo#9703 (template fixes), ai-dynamo#9704 (render --has-trtllm-context),
  ai-dynamo#9705 (gdrcopy v2.5.2 bump)

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 19, 2026
Bumps `nixl_ref: 0.10.1 → v1.1.0` in all four occurrences in
container/context.yaml (one in `dynamo:` common section, one each in
the `vllm:`, `sglang:`, `trtllm:` framework sections), and adds the
Abseil source-build prereq that NIXL >= 1.0.0 requires.

### Why bump NIXL

NIXL v1.1.0 brings four fixes that matter on AWS p6e-gb200 EFA fabric:

  ai-dynamo#1461 NUMA-aware EFA rail selection — without this, the LIBFABRIC
        backend assigns all initiator GPUs to a single rail and caps
        aggregate bandwidth at ~1.79 GB/s on p6e-gb200 instead of
        the expected ~190 GB/s with 4 GPUs + 4 EFA NICs per pod.
  ai-dynamo#1510 Active rail tracking for multi-rail concurrent transfers.
  ai-dynamo#1506 Multi-GPU memory-region fix (relevant for TP > 1 workers
        registering large VRAM blocks across GPUs).
  ai-dynamo#1433 Transfer handle repost notification fix.

These gains were validated by yutwu (NVIDIA teammate) during the
GLM-5.1 v7→v7.5 image bump on AWS p6e-gb200. For dynamo-trtllm
disagg over EFA, NIXL 0.10.1's rail policy is the difference between
"looks correct, caps at 1.79 GB/s" and "190 GB/s aggregate".

### The Abseil prereq

NIXL >= 1.0.0 uses VLOG(1)/DVLOG(2) in nixl_log.h, which require
Abseil >= 20240116. NIXL's meson.build first searches for system
Abseil via pkg-config:

  absl_base_dep = dependency('absl_base', required: false)
  absl_log_dep = dependency('absl_log', required: false)
  ...
  if absl_base_dep.found() and not absl_log_dep.found()
    error('Your Abseil version is too old: found absl_base but missing
           support for absl_log. Cannot fallback to subproject because
           that would result in a mix of Abseil versions at runtime.')

If pkg-config finds an old Abseil (absl_base present but no absl_log),
NIXL HARD-ERRORS at configure time. Subproject fallback only triggers
when NO system Abseil is found.

This trips two ways for dynamo:

  1. **wheel_builder (AlmaLinux 8 / manylinux_2_28)** typically lacks
     system Abseil, so subproject fallback would work — BUT the
     subproject builds shared libs with SONAMEs like
     libabsl_*.so.20250814.
  2. **runtime images (cuda-dl-base Ubuntu 24.04)** ship stock
     libabsl-dev 20220623 (libabsl_*.so.20220623 SONAMEs). NIXL's libs
     built against subproject 20250814 can't dlopen against the runtime
     image's 20220623 — SONAME mismatch.

The clean fix is to pre-install Abseil consistently at /usr/local in
wheel_builder so meson uses it deterministically, then propagate the
.so files to runtime stages. This matches yutwu's validated approach
in the GLM-5.1 EFA-patch script (`install_abseil_from_source`).

### What changes

1. `container/context.yaml`:
   - Bump `nixl_ref: 0.10.1 → v1.1.0` in all 4 sections.
   - Add `dynamo.abseil_ref: 20240722.0` (the Abseil LTS yutwu validated).

2. `container/templates/args.Dockerfile`:
   - Declare `ARG ABSEIL_REF` (gated by `{% if device == "cuda" %}`,
     same as the other NIXL-related ARGs).

3. `container/templates/wheel_builder.Dockerfile`:
   - New RUN block BEFORE the NIXL clone+build that source-builds
     Abseil ${ABSEIL_REF} to /usr/local with `BUILD_SHARED_LIBS=ON`
     + `ABSL_ENABLE_INSTALL=ON`.
   - Gated by `pkg-config --exists absl_log` as a no-op if a future
     base image already ships a recent Abseil.
   - Validates `pkg-config --modversion absl_log` succeeds after
     install (build fails if not).

4. `container/templates/dynamo_runtime.Dockerfile`:
   - New COPY line bringing /usr/local/lib/libabsl_*.so* from
     wheel_builder so libnixl can resolve its Abseil deps at dlopen.

5. `container/templates/trtllm_runtime.Dockerfile`:
   - Same Abseil .so COPY. trtllm_runtime overrides the dynamo runtime
     stage, so it needs its own COPY independently.

vllm and sglang frameworks are unaffected at the runtime level —
they use upstream image NIXL packages (nixl-cu12 from vllm-openai,
sglang's bundled NIXL), not dynamo's wheel_builder NIXL. Their
wheel_builder stage still builds Abseil (gated by `device == "cuda"`),
which is consistent with the existing pattern of building UCX/NIXL/etc.
in wheel_builder regardless of whether the framework runtime uses them.

### Why source-build (and not apt-install a newer libabsl-dev)

I evaluated `apt-install libabsl-dev` as an option. Ubuntu 24.04's
main archive ships `libabsl-dev 20220623.1-1build1`, which is exactly
the version NIXL hard-errors on. Backports / -updates / -proposed
don't have a newer version. yutwu hit the same investigation and
arrived at source-build as the only working path.

If a maintainer knows of a repo (NVIDIA apt, PPA, etc.) with a newer
libabsl-dev that works on Ubuntu 24.04 / cuda-dl-base, I'm happy to
swap source-build for apt-install — the source-build adds ~2-3 min
of build time per arch.

### Risk

MEDIUM. This is a major version bump (0.x → 1.x) plus a new build
dependency. Specifically:

  - The C++ libnixl.so ABI is the primary compat concern. Subproject
    fallback would have produced a 20250814 Abseil SONAME mismatch at
    runtime; this PR avoids that by pinning to 20240722.0 consistently
    across build + runtime stages.
  - Python NIXL bindings may have changed across 0.10.1 → v1.1.0.
    Dynamo's serving code that imports `nixl` needs verification.
    Worker init + a basic KV transfer smoke is sufficient.
  - Image size: +~30-40 MB for the Abseil shared libs in runtime
    stages (libabsl_log, libabsl_base, libabsl_strings, libabsl_status,
    libabsl_synchronization, libabsl_time, libabsl_flat_hash_map, etc.
    plus their dependency closure).
  - Build time: +~2-3 min in wheel_builder for the Abseil compile.

### Tag-format note

`ai-dynamo/nixl` mixes tag conventions (older releases lack `v` prefix,
newer ones use it):
  - Older releases: `0.1.1`, `0.10.0`, `0.10.1` (no `v` prefix)
  - Newer releases: `v1.1.0` (with `v` prefix)

This bump uses `v1.1.0` to match the upstream tag. wheel_builder uses
`git checkout ${NIXL_REF}` so the value must be exactly the tag name.

### What's needed before this can merge

- [ ] `trtllm-pipeline` / `vllm-pipeline` / `sglang-pipeline` /
      `dynamo-pipeline` CI pass with both Abseil source-build and
      NIXL v1.1.0.
- [ ] Manual smoke: rebuild a trtllm-runtime image, deploy a 1P1D
      Qwen3-8B disagg, confirm KV transfer works end-to-end.
- [ ] (Stretch) `nixlbench` from inside the rebuilt image hits the
      expected ~190 GB/s on full-node p6e-gb200 allocation.

### References

- NIXL v1.1.0 release: https://github.com/ai-dynamo/nixl/releases/tag/v1.1.0
- NIXL meson Abseil logic:
  https://github.com/ai-dynamo/nixl/blob/v1.1.0/meson.build (the
  hard-error block we're working around)

Companion PRs:
  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - this PR — `build(container)`: nixl_ref v1.1.0 + Abseil prereq
  - ai-dynamo#9727 — `feat(container)`: patched libfabric in aws stage

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 19, 2026
Bumps `nixl_ref: 0.10.1 → v1.1.0` in all four occurrences in
container/context.yaml (one in `dynamo:` common section, one each in
the `vllm:`, `sglang:`, `trtllm:` framework sections), and adds the
Abseil source-build prereq that NIXL >= 1.0.0 requires.

### Why bump NIXL

NIXL v1.1.0 brings four fixes that matter on AWS p6e-gb200 EFA fabric:

  ai-dynamo#1461 NUMA-aware EFA rail selection — without this, the LIBFABRIC
        backend assigns all initiator GPUs to a single rail and caps
        aggregate bandwidth at ~1.79 GB/s on p6e-gb200 instead of
        the expected ~190 GB/s with 4 GPUs + 4 EFA NICs per pod.
  ai-dynamo#1510 Active rail tracking for multi-rail concurrent transfers.
  ai-dynamo#1506 Multi-GPU memory-region fix (relevant for TP > 1 workers
        registering large VRAM blocks across GPUs).
  ai-dynamo#1433 Transfer handle repost notification fix.

These gains were validated by yutwu (NVIDIA teammate) during the
GLM-5.1 v7→v7.5 image bump on AWS p6e-gb200. For dynamo-trtllm
disagg over EFA, NIXL 0.10.1's rail policy is the difference between
"looks correct, caps at 1.79 GB/s" and "190 GB/s aggregate".

### The Abseil prereq

NIXL >= 1.0.0 uses VLOG(1)/DVLOG(2) in nixl_log.h, which require
Abseil >= 20240116. NIXL's meson.build first searches for system
Abseil via pkg-config:

  absl_base_dep = dependency('absl_base', required: false)
  absl_log_dep = dependency('absl_log', required: false)
  ...
  if absl_base_dep.found() and not absl_log_dep.found()
    error('Your Abseil version is too old: found absl_base but missing
           support for absl_log. Cannot fallback to subproject because
           that would result in a mix of Abseil versions at runtime.')

If pkg-config finds an old Abseil (absl_base present but no absl_log),
NIXL HARD-ERRORS at configure time. Subproject fallback only triggers
when NO system Abseil is found.

This trips two ways for dynamo:

  1. **wheel_builder (AlmaLinux 8 / manylinux_2_28)** typically lacks
     system Abseil, so subproject fallback would work — BUT the
     subproject builds shared libs with SONAMEs like
     libabsl_*.so.20250814.
  2. **runtime images (cuda-dl-base Ubuntu 24.04)** ship stock
     libabsl-dev 20220623 (libabsl_*.so.20220623 SONAMEs). NIXL's libs
     built against subproject 20250814 can't dlopen against the runtime
     image's 20220623 — SONAME mismatch.

The clean fix is to pre-install Abseil consistently at /usr/local in
wheel_builder so meson uses it deterministically, then propagate the
.so files to runtime stages. This matches yutwu's validated approach
in the GLM-5.1 EFA-patch script (`install_abseil_from_source`).

### What changes

1. `container/context.yaml`:
   - Bump `nixl_ref: 0.10.1 → v1.1.0` in all 4 sections.
   - Add `dynamo.abseil_ref: 20240722.0` (the Abseil LTS yutwu validated).

2. `container/templates/args.Dockerfile`:
   - Declare `ARG ABSEIL_REF` (gated by `{% if device == "cuda" %}`,
     same as the other NIXL-related ARGs).

3. `container/templates/wheel_builder.Dockerfile`:
   - New RUN block BEFORE the NIXL clone+build that source-builds
     Abseil ${ABSEIL_REF} to /usr/local with `BUILD_SHARED_LIBS=ON`
     + `ABSL_ENABLE_INSTALL=ON`.
   - Gated by `pkg-config --exists absl_log` as a no-op if a future
     base image already ships a recent Abseil.
   - Validates `pkg-config --modversion absl_log` succeeds after
     install (build fails if not).

4. `container/templates/dynamo_runtime.Dockerfile`:
   - New COPY line bringing /usr/local/lib/libabsl_*.so* from
     wheel_builder so libnixl can resolve its Abseil deps at dlopen.

5. `container/templates/trtllm_runtime.Dockerfile`:
   - Same Abseil .so COPY. trtllm_runtime overrides the dynamo runtime
     stage, so it needs its own COPY independently.

vllm and sglang frameworks are unaffected at the runtime level —
they use upstream image NIXL packages (nixl-cu12 from vllm-openai,
sglang's bundled NIXL), not dynamo's wheel_builder NIXL. Their
wheel_builder stage still builds Abseil (gated by `device == "cuda"`),
which is consistent with the existing pattern of building UCX/NIXL/etc.
in wheel_builder regardless of whether the framework runtime uses them.

### Why source-build (and not apt-install a newer libabsl-dev)

I evaluated `apt-install libabsl-dev` as an option. Ubuntu 24.04's
main archive ships `libabsl-dev 20220623.1-1build1`, which is exactly
the version NIXL hard-errors on. Backports / -updates / -proposed
don't have a newer version. yutwu hit the same investigation and
arrived at source-build as the only working path.

If a maintainer knows of a repo (NVIDIA apt, PPA, etc.) with a newer
libabsl-dev that works on Ubuntu 24.04 / cuda-dl-base, I'm happy to
swap source-build for apt-install — the source-build adds ~2-3 min
of build time per arch.

### Risk

MEDIUM. This is a major version bump (0.x → 1.x) plus a new build
dependency. Specifically:

  - The C++ libnixl.so ABI is the primary compat concern. Subproject
    fallback would have produced a 20250814 Abseil SONAME mismatch at
    runtime; this PR avoids that by pinning to 20240722.0 consistently
    across build + runtime stages.
  - Python NIXL bindings may have changed across 0.10.1 → v1.1.0.
    Dynamo's serving code that imports `nixl` needs verification.
    Worker init + a basic KV transfer smoke is sufficient.
  - Image size: +~30-40 MB for the Abseil shared libs in runtime
    stages (libabsl_log, libabsl_base, libabsl_strings, libabsl_status,
    libabsl_synchronization, libabsl_time, libabsl_flat_hash_map, etc.
    plus their dependency closure).
  - Build time: +~2-3 min in wheel_builder for the Abseil compile.

### Tag-format note

`ai-dynamo/nixl` mixes tag conventions (older releases lack `v` prefix,
newer ones use it):
  - Older releases: `0.1.1`, `0.10.0`, `0.10.1` (no `v` prefix)
  - Newer releases: `v1.1.0` (with `v` prefix)

This bump uses `v1.1.0` to match the upstream tag. wheel_builder uses
`git checkout ${NIXL_REF}` so the value must be exactly the tag name.

### What's needed before this can merge

- [ ] `trtllm-pipeline` / `vllm-pipeline` / `sglang-pipeline` /
      `dynamo-pipeline` CI pass with both Abseil source-build and
      NIXL v1.1.0.
- [ ] Manual smoke: rebuild a trtllm-runtime image, deploy a 1P1D
      Qwen3-8B disagg, confirm KV transfer works end-to-end.
- [ ] (Stretch) `nixlbench` from inside the rebuilt image hits the
      expected ~190 GB/s on full-node p6e-gb200 allocation.

### References

- NIXL v1.1.0 release: https://github.com/ai-dynamo/nixl/releases/tag/v1.1.0
- NIXL meson Abseil logic:
  https://github.com/ai-dynamo/nixl/blob/v1.1.0/meson.build (the
  hard-error block we're working around)

Companion PRs:
  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - this PR — `build(container)`: nixl_ref v1.1.0 + Abseil prereq
  - ai-dynamo#9727 — `feat(container)`: patched libfabric in aws stage

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 19, 2026
Bumps `nixl_ref: 0.10.1 → v1.1.0` in all four occurrences in
container/context.yaml (one in `dynamo:` common section, one each in
the `vllm:`, `sglang:`, `trtllm:` framework sections), and adds the
Abseil source-build prereq that NIXL >= 1.0.0 requires.

### Why bump NIXL

NIXL v1.1.0 brings four fixes that matter on AWS p6e-gb200 EFA fabric:

  ai-dynamo#1461 NUMA-aware EFA rail selection — without this, the LIBFABRIC
        backend assigns all initiator GPUs to a single rail and caps
        aggregate bandwidth at ~1.79 GB/s on p6e-gb200 instead of
        the expected ~190 GB/s with 4 GPUs + 4 EFA NICs per pod.
  ai-dynamo#1510 Active rail tracking for multi-rail concurrent transfers.
  ai-dynamo#1506 Multi-GPU memory-region fix (relevant for TP > 1 workers
        registering large VRAM blocks across GPUs).
  ai-dynamo#1433 Transfer handle repost notification fix.

These gains were validated by yutwu (NVIDIA teammate) during the
GLM-5.1 v7→v7.5 image bump on AWS p6e-gb200. For dynamo-trtllm
disagg over EFA, NIXL 0.10.1's rail policy is the difference between
"looks correct, caps at 1.79 GB/s" and "190 GB/s aggregate".

### The Abseil prereq

NIXL >= 1.0.0 uses VLOG(1)/DVLOG(2) in nixl_log.h, which require
Abseil >= 20240116. NIXL's meson.build first searches for system
Abseil via pkg-config:

  absl_base_dep = dependency('absl_base', required: false)
  absl_log_dep = dependency('absl_log', required: false)
  ...
  if absl_base_dep.found() and not absl_log_dep.found()
    error('Your Abseil version is too old: found absl_base but missing
           support for absl_log. Cannot fallback to subproject because
           that would result in a mix of Abseil versions at runtime.')

If pkg-config finds an old Abseil (absl_base present but no absl_log),
NIXL HARD-ERRORS at configure time. Subproject fallback only triggers
when NO system Abseil is found.

This trips two ways for dynamo:

  1. **wheel_builder (AlmaLinux 8 / manylinux_2_28)** typically lacks
     system Abseil, so subproject fallback would work — BUT the
     subproject builds shared libs with SONAMEs like
     libabsl_*.so.20250814.
  2. **runtime images (cuda-dl-base Ubuntu 24.04)** ship stock
     libabsl-dev 20220623 (libabsl_*.so.20220623 SONAMEs). NIXL's libs
     built against subproject 20250814 can't dlopen against the runtime
     image's 20220623 — SONAME mismatch.

The clean fix is to pre-install Abseil consistently at /usr/local in
wheel_builder so meson uses it deterministically, then propagate the
.so files to runtime stages. This matches yutwu's validated approach
in the GLM-5.1 EFA-patch script (`install_abseil_from_source`).

### What changes

1. `container/context.yaml`:
   - Bump `nixl_ref: 0.10.1 → v1.1.0` in all 4 sections.
   - Add `dynamo.abseil_ref: 20240722.0` (the Abseil LTS yutwu validated).

2. `container/templates/args.Dockerfile`:
   - Declare `ARG ABSEIL_REF` (gated by `{% if device == "cuda" %}`,
     same as the other NIXL-related ARGs).

3. `container/templates/wheel_builder.Dockerfile`:
   - New RUN block BEFORE the NIXL clone+build that source-builds
     Abseil ${ABSEIL_REF} to /usr/local with `BUILD_SHARED_LIBS=ON`
     + `ABSL_ENABLE_INSTALL=ON`.
   - Gated by `pkg-config --exists absl_log` as a no-op if a future
     base image already ships a recent Abseil.
   - Validates `pkg-config --modversion absl_log` succeeds after
     install (build fails if not).

4. `container/templates/dynamo_runtime.Dockerfile`:
   - New COPY line bringing /usr/local/lib/libabsl_*.so* from
     wheel_builder so libnixl can resolve its Abseil deps at dlopen.

5. `container/templates/trtllm_runtime.Dockerfile`:
   - Same Abseil .so COPY. trtllm_runtime overrides the dynamo runtime
     stage, so it needs its own COPY independently.

vllm and sglang frameworks are unaffected at the runtime level —
they use upstream image NIXL packages (nixl-cu12 from vllm-openai,
sglang's bundled NIXL), not dynamo's wheel_builder NIXL. Their
wheel_builder stage still builds Abseil (gated by `device == "cuda"`),
which is consistent with the existing pattern of building UCX/NIXL/etc.
in wheel_builder regardless of whether the framework runtime uses them.

### Why source-build (and not apt-install a newer libabsl-dev)

I evaluated `apt-install libabsl-dev` as an option. Ubuntu 24.04's
main archive ships `libabsl-dev 20220623.1-1build1`, which is exactly
the version NIXL hard-errors on. Backports / -updates / -proposed
don't have a newer version. yutwu hit the same investigation and
arrived at source-build as the only working path.

If a maintainer knows of a repo (NVIDIA apt, PPA, etc.) with a newer
libabsl-dev that works on Ubuntu 24.04 / cuda-dl-base, I'm happy to
swap source-build for apt-install — the source-build adds ~2-3 min
of build time per arch.

### Risk

MEDIUM. This is a major version bump (0.x → 1.x) plus a new build
dependency. Specifically:

  - The C++ libnixl.so ABI is the primary compat concern. Subproject
    fallback would have produced a 20250814 Abseil SONAME mismatch at
    runtime; this PR avoids that by pinning to 20240722.0 consistently
    across build + runtime stages.
  - Python NIXL bindings may have changed across 0.10.1 → v1.1.0.
    Dynamo's serving code that imports `nixl` needs verification.
    Worker init + a basic KV transfer smoke is sufficient.
  - Image size: +~30-40 MB for the Abseil shared libs in runtime
    stages (libabsl_log, libabsl_base, libabsl_strings, libabsl_status,
    libabsl_synchronization, libabsl_time, libabsl_flat_hash_map, etc.
    plus their dependency closure).
  - Build time: +~2-3 min in wheel_builder for the Abseil compile.

### Tag-format note

`ai-dynamo/nixl` mixes tag conventions (older releases lack `v` prefix,
newer ones use it):
  - Older releases: `0.1.1`, `0.10.0`, `0.10.1` (no `v` prefix)
  - Newer releases: `v1.1.0` (with `v` prefix)

This bump uses `v1.1.0` to match the upstream tag. wheel_builder uses
`git checkout ${NIXL_REF}` so the value must be exactly the tag name.

### What's needed before this can merge

- [ ] `trtllm-pipeline` / `vllm-pipeline` / `sglang-pipeline` /
      `dynamo-pipeline` CI pass with both Abseil source-build and
      NIXL v1.1.0.
- [ ] Manual smoke: rebuild a trtllm-runtime image, deploy a 1P1D
      Qwen3-8B disagg, confirm KV transfer works end-to-end.
- [ ] (Stretch) `nixlbench` from inside the rebuilt image hits the
      expected ~190 GB/s on full-node p6e-gb200 allocation.

### References

- NIXL v1.1.0 release: https://github.com/ai-dynamo/nixl/releases/tag/v1.1.0
- NIXL meson Abseil logic:
  https://github.com/ai-dynamo/nixl/blob/v1.1.0/meson.build (the
  hard-error block we're working around)

Companion PRs:
  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - this PR — `build(container)`: nixl_ref v1.1.0 + Abseil prereq
  - ai-dynamo#9727 — `feat(container)`: patched libfabric in aws stage

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 19, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; our patched libfabric
    overwrites the stock binary regardless of underlying EFA installer
    version, so a defensive 1.46.0 pin is not strictly needed). Add
    new keys `dynamo.patched_libfabric_repo` and
    `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and
    `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

The related ofi-nccl rm path fix (Gap B in our investigation; the
existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA
installer doesn't create) is intentionally NOT in this PR — it's
owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main;
if before, this PR will need a trivial conflict resolution on the
existing rm line.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 20, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; our patched libfabric
    overwrites the stock binary regardless of underlying EFA installer
    version, so a defensive 1.46.0 pin is not strictly needed). Add
    new keys `dynamo.patched_libfabric_repo` and
    `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and
    `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

The related ofi-nccl rm path fix (Gap B in our investigation; the
existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA
installer doesn't create) is intentionally NOT in this PR — it's
owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main;
if before, this PR will need a trivial conflict resolution on the
existing rm line.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 20, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; our patched libfabric
    overwrites the stock binary regardless of underlying EFA installer
    version, so a defensive 1.46.0 pin is not strictly needed). Add
    new keys `dynamo.patched_libfabric_repo` and
    `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and
    `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

The related ofi-nccl rm path fix (Gap B in our investigation; the
existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA
installer doesn't create) is intentionally NOT in this PR — it's
owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main;
if before, this PR will need a trivial conflict resolution on the
existing rm line.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 21, 2026
… GDRCopy known issues

Three additions to docs/kubernetes/cloud-providers/eks/efa.md, all inside
the existing Known Issues block:

1. Issue 1 (libfabric CUDA dmabuf): the existing workaround was missing
   three defensive steps that cause it to silently produce a broken image
   on GB200 — SONAME symlink force (make install does NOT overwrite the
   EFA installer's libfabric.so.1 → stock 1.30.x lookup wins), stock
   binary cleanup (defends against hardcoded RPATHs), and build-time
   fi_info --version validation (fails the build instead of failing in
   production). Added all three inline, with a Tracking PR pointer to
   ai-dynamo#9727 which upstreams the same fix.

2. Issue 2 (NEW): TRT-LLM rc14's libtensorrt_llm_nixl_wrapper.so is
   compiled against NIXL 0.9.x and references types dropped in NIXL ≥ 1.0
   (nixlDescList<nixlBlobDesc>, <nixlBasicDesc>). Bumping dynamo's
   nixl_ref to v1.1.0 CrashLoopBackOffs every TRT-LLM disagg pod at
   executor init. Workaround: keep nixl_ref at 0.10.1; don't merge ai-dynamo#9706
   until TRT-LLM upgrades its wrapper.

3. Issue 3 (NEW): GDRCopy v2.5.1 source RPM fails to compile against host
   kernel ≥ 6.15 (vm_flags_set redefinition). /dev/gdrdrv missing →
   GPU-Direct RDMA falls back to slower paths. Workaround: bump
   nixl_gdrcopy_ref to v2.5.2 (tracked in ai-dynamo#9705).

Also added a summary table at the top of Known Issues for at-a-glance
triage, and two new rows to the Common Failure Modes table mapping the
new symptom signatures to Issues 2 and 3.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 21, 2026
…el-6.15 known issue

Two additions to docs/kubernetes/cloud-providers/eks/efa.md, both inside
the existing Known Issues block:

1. Issue 1 (libfabric CUDA dmabuf): the existing workaround was missing
   three defensive steps that cause it to silently produce a broken image
   on GB200 — SONAME symlink force (make install does NOT overwrite the
   EFA installer's libfabric.so.1 → stock 1.30.x lookup wins), stock
   binary cleanup (defends against hardcoded RPATHs), and build-time
   fi_info --version validation (fails the build instead of failing in
   production). Added all three inline, with a Tracking PR pointer to
   ai-dynamo#9727 which upstreams the same fix.

2. Issue 2 (NEW): GDRCopy v2.5.1 source RPM fails to compile against host
   kernel ≥ 6.15 (vm_flags_set redefinition). /dev/gdrdrv missing →
   GPU-Direct RDMA falls back to slower paths. Workaround: bump
   nixl_gdrcopy_ref to v2.5.2 (tracked in ai-dynamo#9705).

Also added a summary table at the top of Known Issues for at-a-glance
triage, and one new row to the Common Failure Modes table mapping the
new symptom signature to Issue 2.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
GDRCopy v2.5.1 fails to build on host kernel ≥ 6.15 with a
`vm_flags_set` redefinition error (Linux 6.15 changed the symbol's
declaration). v2.5.2 fixes this. NVIDIA/gdrcopy v2.5.2 release notes:
https://github.com/NVIDIA/gdrcopy/releases/tag/v2.5.2

This matters for EKS deployments on AMIs that ship kernels ≥ 6.15
(currently most non-Amazon-Linux Ubuntu 24.04 nodes). With v2.5.1,
the wheel_builder stage builds the GDRCopy *userspace* library fine
but the resulting kmod can't compile against a 6.15+ host kernel —
relevant when the host needs to load the kmod for GPU Direct RDMA.

The userspace library v2.5.2 is fully backward-compatible with v2.5.1
APIs, so the wheel_builder stage and NIXL linkage are unaffected.

Tested:
  - render trtllm runtime with `--make-efa`, build the wheel_builder
    stage, confirm `gdrcopy/CHANGELOG.md` inside the image lists 2.5.2

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
@yifjiang yifjiang force-pushed the yifjiang/bump-nixl-gdrcopy-2.5.2 branch from 44da767 to 1970b32 Compare May 21, 2026 17:58
@yifjiang
Copy link
Copy Markdown
Contributor Author

/ok to test 1970b32

yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 21, 2026
The `--make-efa` aws stage currently installs only the stock AWS EFA
userspace via `efa_installer.sh -y --skip-kmod ...`. That stock libfabric
(2.4.0amzn1.0 in EFA installer 1.46.0+) has a CUDA HMEM bug: when
registering VRAM as an EFA memory region on GB200 hardware, fi_mr_reg
falls through to ibv_reg_mr() with a GPU virtual address and returns
EFAULT (`fi_mr_reg ... Bad address`). The entire LIBFABRIC NIXL backend
becomes unusable for VRAM-to-VRAM KV transfers — the primary EFA use
case for dynamo-trtllm disagg.

The fix is in aws/libfabric v2.3.1amzn4.0 (corresponds to upstream PR
ofiwg/libfabric#12216). Until that lands in the
EFA installer's default libfabric, every `--make-efa` image needs to
build and overlay the patched version.

This PR moves that overlay from a downstream post-process script (the
AWS-team's install_efa_libfabric_nixl.sh `PATCH_LIBFABRIC_EFA_CUDA_DMABUF=1`
path, which we've been running on top of dynamo images for the v9-efa
fix1/fix2 cycle) into the aws.Dockerfile template itself.

What changes:

  - container/context.yaml: leave `dynamo.efa_version: 1.47.0` as-is
    (1.47.0 empirically works on cuda-dl-base; our patched libfabric
    overwrites the stock binary regardless of underlying EFA installer
    version, so a defensive 1.46.0 pin is not strictly needed). Add
    new keys `dynamo.patched_libfabric_repo` and
    `dynamo.patched_libfabric_ref` with defaults `aws/libfabric` and
    `v2.3.1amzn4.0`. Add a defensive comment about 1.48.0 being broken
    on Ubuntu 24.04 (libfabric1-aws's ibverbs-providers >= 59 dep is
    unsatisfiable on Ubuntu's 50.x) so future bumpers don't trip over it.

  - container/templates/args.Dockerfile: declare two new ARGs
    (gated by `{% if make_efa == true %}`).

  - container/templates/aws.Dockerfile: after the existing EFA installer
    RUN, add a second RUN that installs build deps, clones the patched
    libfabric, builds with `--enable-efa --with-cuda --enable-cuda-dlopen`,
    installs over `/opt/amazon/efa/`, forces the `libfabric.so.1` SONAME
    symlink to the patched binary in BOTH `lib` and `lib64` (EFA installer
    can populate either; whichever ldconfig sees first wins), deletes
    stock `libfabric.so.1.30.*` binaries (defends against hardcoded
    RPATHs), runs `ldconfig`, and VALIDATES `fi_info --version` reports
    the patched libfabric. The build fails if validation fails — turns
    a deployment-time error into a build-time error.

Non-EFA images are completely unchanged: the patched-libfabric build is
gated by `{% if make_efa == true %}` in args.Dockerfile and only renders
into the aws stage.

The related ofi-nccl rm path fix (Gap B in our investigation; the
existing `rm -rf /opt/amazon/aws-ofi-nccl` targets a path the EFA
installer doesn't create) is intentionally NOT in this PR — it's
owned by ai-dynamo#9703. If ai-dynamo#9703 lands after this PR, rebase ai-dynamo#9703 onto main;
if before, this PR will need a trivial conflict resolution on the
existing rm line.

### Validation evidence

Built and validated as the v9-efa-fix2 images:

  - `head-pr13713-dynamomain-v9-efa-arm64-gb200-fix2`
    (sha256:5934bc9b8809fe0e900ea79cfe0648e88ff5474fc2fcc2ffa5668ed11bb82337)
  - `head-pr13713-dynamomain-v9-efa-x86-h100-b200-fix2`
    (sha256:2a0857c8e4fbabafa6e2e86c33d7e194b1bcbb740b681906a8067a0018129c22)

Both images shipped this exact patched libfabric stack via a layered
post-process. On AWS dev-01 (p6e-gb200), the patched libfabric passed
nixlbench LIBFABRIC backend at expected single-rail ~47 GB/s; live TRT-LLM
disagg with Qwen3-8B BF16, TP=4 served 5/5 stable requests with HTTP 200.

### Risk

MEDIUM. The aws stage now compiles libfabric from source, which adds
~3-5 min of build time and ~200-300 MB of image size (build deps left
installed for simplicity — separate PR can purge them if size is a
concern). Non-EFA paths unaffected.

The build deps added (autoconf, automake, libtool, make, build-essential,
pkg-config, libnl-3-dev, libnl-route-3-dev, libnuma-dev, libibverbs-dev,
rdma-core, ca-certificates, git) are a superset of what's already pulled
in by the EFA installer's own `apt install`, so the marginal cost is
small. cuda-dl-base provides `/usr/local/cuda`, which the patched
libfabric needs for `--with-cuda`.

### Companion PRs

This PR is the final missing piece to internalize what the
`install_efa_libfabric_nixl_fix2.sh` post-process does. The other pieces:

  - ai-dynamo#9703 — `fix(container)`: ofi-nccl rm path + clearer HAS_TRTLLM_CONTEXT error
  - ai-dynamo#9704 — `feat(container)`: render.py --has-trtllm-context flag
  - ai-dynamo#9705 — `build(container)`: nixl_gdrcopy_ref v2.5.1 → v2.5.2
  - ai-dynamo#9706 — `build(container)`: nixl_ref 0.10.1 → v1.1.0 (with Abseil
            dep work as a follow-up)
  - this PR — `feat(container)`: patched libfabric in aws stage

Together: a single `python3 container/render.py --framework trtllm --target
runtime --platform linux/amd64 --cuda-version 13.1 --make-efa
--has-trtllm-context` followed by `docker build ...` produces an
EFA-correct image with no post-process script needed. nixlbench (the
validation tool) is intentionally NOT included — it can live as a separate
`--make-nixlbench` flag if desired, or stay as an external layer.

Marked draft as RFC. Maintainers may prefer a different code layout
(separate builder stage to keep build deps out of the runtime, different
default repo/ref, or different validation approach). Happy to iterate.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 22, 2026
…el-6.15 known issue

Two additions to docs/kubernetes/cloud-providers/eks/efa.md, both inside
the existing Known Issues block:

1. Issue 1 (libfabric CUDA dmabuf): the existing workaround was missing
   three defensive steps that cause it to silently produce a broken image
   on GB200 — SONAME symlink force (make install does NOT overwrite the
   EFA installer's libfabric.so.1 → stock 1.30.x lookup wins), stock
   binary cleanup (defends against hardcoded RPATHs), and build-time
   fi_info --version validation (fails the build instead of failing in
   production). Added all three inline, with a Tracking PR pointer to
   ai-dynamo#9727 which upstreams the same fix.

2. Issue 2 (NEW): GDRCopy v2.5.1 source RPM fails to compile against host
   kernel ≥ 6.15 (vm_flags_set redefinition). /dev/gdrdrv missing →
   GPU-Direct RDMA falls back to slower paths. Workaround: bump
   nixl_gdrcopy_ref to v2.5.2 (tracked in ai-dynamo#9705).

Also added a summary table at the top of Known Issues for at-a-glance
triage, and one new row to the Common Failure Modes table mapping the
new symptom signature to Issue 2.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
@yifjiang yifjiang marked this pull request as ready for review May 22, 2026 17:29
@yifjiang yifjiang requested review from a team as code owners May 22, 2026 17:29
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cb427b78-c093-4170-a856-62fc4b462f34

📥 Commits

Reviewing files that changed from the base of the PR and between 89f6190 and 1970b32.

📒 Files selected for processing (1)
  • container/context.yaml

Walkthrough

The container/context.yaml file updates the Dynamo CUDA/NIXL build configuration by bumping the dynamo.nixl_gdrcopy_ref version reference from v2.5.1 to v2.5.2.

Changes

Configuration version update

Layer / File(s) Summary
NIXL GDRCopy version reference
container/context.yaml
The dynamo.nixl_gdrcopy_ref ARG default is bumped from v2.5.1 to v2.5.2 in the Dynamo container build configuration.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: bumping nixl_gdrcopy_ref to v2.5.2 and notes the kernel ≥6.15 fix rationale.
Description check ✅ Passed The description provides comprehensive context including summary, detailed rationale, companion PRs, risk assessment, and test plan. However, it does not follow the required template structure with explicit Overview, Details, Where should the reviewer start, and Related Issues sections.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@yifjiang yifjiang merged commit fffea8c into ai-dynamo:main May 22, 2026
96 checks passed
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 23, 2026
…es are merged

Removes the "Known Issues" section from docs/kubernetes/cloud-providers/eks/efa.md
and prunes the two now-obsolete rows from "Common Failure Modes". Assumes
ai-dynamo#9703, ai-dynamo#9704, ai-dynamo#9705, and ai-dynamo#9727 are all merged — after those land, the issues
this section documented (GB200 fi_mr_reg(VRAM) failure on the EFA installer's
stock libfabric, and GDRCopy v2.5.1 kmod build failure on kernel >= 6.15) no
longer affect default --make-efa builds, so the inline workarounds the section
provided would mislead readers.

Also removes the ofiwg/libfabric#12019 reference from the bottom links list
since it points at the same now-resolved upstream issue.

Net diff: -34 / +1.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build container external-contribution Pull request is from an external contributor size/XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants