[Feature] Eager RDMA for lower latency, fenceless token delivery #437
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR: This feature branch replaces the standard “write-then-atomic ACK” path with an Eager RDMA layout that bundles data and per-token signals into a single
IBV_WR_RDMA_WRITE. It removes the extra RTT latency introduced by atomics on multi-path NICs and enables fine-grained overlap at the receiver. In our tests (Hopper + BlueField-3, multi-path enabled), Eager RDMA reduces latency by up to 20%.Background
DeepEP’s current implementation follows the InfiniBand Architecture Specification and uses a standard acknowledgement sequence on the same QP:
IBV_WR_RDMA_WRITE(data “tokens”).IBV_WR_RDMA_ATOMIC_FETCH_AND_ADDas the confirmation.This guarantees that all tokens posted before the atomic are visible when the atomic is observed.
On modern multi-path capable adapters (e.g., NVIDIA ConnectX-7, BlueField-3) and multi-path fabrics (InfiniBand Adaptive Routing, multi-path Ethernet), the network may reorder the atomic and write packets. To preserve ordering semantics, the stack will refrain from issuing the atomic until all in-flight writes on that QP have completed (i.e., the fence)—effectively injecting one RTT of latency for each atomic that follows writes.
Eager RDMA removes atomics and fences from the critical path by interleaving data and a small “signal” field inside every write WQE. The sender emits a single
IBV_WR_RDMA_WRITEthat carries both payload and signals; the receiver polls those signals directly in device memory to determine readiness. This further enables per-token acknowledgement, receiver can process tokens as soon as they’re ready (fine-grained overlap between intra-node D2D copies and inter-node sending/receiving).Requirements
Ordering Guarantees
On Hopper or newer GPUs, the DMA writes arriving from the RNIC to GPU memory and the memory update order visible to GPU kernels are ordered when all of the following hold:
CU_POINTER_ATTRIBUTE_SYNC_MEMOPSis set viacuPointerSetAttribute(NVSHMEM sets this by default).IBV_ACCESS_RELAXED_ORDERING(By disablingNVSHMEM_IB_ENABLE_RELAXED_ORDERING).Under these conditions, when a full-MTU RDMA write completes, the GPU’s memory view is ordered—i.e., once the last 16-byte signal is updated, the preceding 4080 bytes of data in that MTU tile are also visible.
Layout & Data Path
Sender layout
Each 4096-byte MTU tile is laid out as:
The sender maps its normal memory view to the Eager RDMA view and issues
IBV_WR_RDMA_WRITEWQEs with this layout. Original atomics are replaced by RDMA writes.Receiver progress
Programming Model & APIs
Eager RDMA provides two ways to integrate with existing kernels:
Wrapped load/store APIs (global memory):
If your code uses explicit loads/stores, call the Eager RDMA wrappers to access the interleaved view safely.
In-place TMA view transforms (shared memory):
If your code uses TMA (Tensor Memory Accelerator), use the provided in-place transform to construct/deconstruct the Eager RDMA layout when staging through shared memory.
Performance