Add Elasticity Support via NIXL Integration #465
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Add Elasticity Support to DeepEP via NIXL Integration
Summary
This PR adds elastic scaling capabilities to DeepEP, enabling dynamic addition and removal of processes (ranks) during runtime on demand, without affecting the existing connections. This is achieved by integrating NVIDIA Inference Xfer Library (NIXL), a high-performance communication library that utilizes RDMA and NVLink transports and supports dynamic rank management.
Note: Currently, this PR replaces NVSHMEM calls with NIXL calls. However, we would like to discuss the best way to enable support for multiple communication libraries.
Included in this PR:
✅ Integrated NIXL with DeepEP (low-latency & internode modes), tested with DeepEP's original benchmark
✅ Tested with TRT-LLM, vLLM and SGLANG
✅ Introduced new buffer APIs for elastic addition/removal of ranks
✅ Extended DeepEP's benchmark to add/remove ranks in runtime
Next Steps
⬜ Support multiple calls to
update_memory_buffers()for elastic allocation/deallocation of GPU memory⬜ Integrate DeepEP's failure detection with remove_ranks API
⬜ Support elasticity in intranode kernels
New Buffer Initialization Pattern:
New Buffer APIs:
nixl_buffer(rank_id, low_latency_mode, low_latency_nvlink_backend, explicitly_destroy): Initialize the NIXL communication bufferupdate_memory_buffers(num_ranks, num_experts_per_rank, num_nvl_bytes, num_rdma_bytes): Prepare buffers for up tonum_ranksranks andnum_experts_per_rankexpertsconnect_ranks(remote_ranks): Establish NIXL connections to new peers (can be called multiple times)remove_ranks(remote_ranks): Clean up connections to departing peersTesting
New elastic test suite in
tests/elastic/:Example Plan (
expansion_contraction.json):This plan defines three phases:
Performance Testing
All benchmarks were conducted on 2 NVIDIA EOS cluster nodes (8× H100 GPUs and 8× CX7 NICs per node, InfiniBand interconnect), totaling 16 ranks.
Low-Latency Kernels (128 tokens, hidden = 7168, top-k = 8, 16 ranks, 32 experts):
Internode Kernels (4096 tokens, hidden = 7168, top-k = 8, 16 ranks, 256 experts):
Example Launch
Refer to NIXL_README.md for detailed instructions on how to run the elastic test suite.