Allow NVSHMEM PE to NIC mapping to be initialized by DeepEP rank #466
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
The
nvshmemi_get_devices_by_distancedefault initialization method in NVSHMEM does not work optimally for GPU configurations where 2 GPU and 2 RDMA NIC share a PCIe bus, such as the x86 based GCP A3 Ultra H200 and A4 B200 instance types: https://cloud.google.com/compute/docs/gpus/gpu-network-bandwidth#h200-gpus. GPU0 and GPU1 (on two independent processes) observe NIC0 and NIC1 on the same PCIe switch as being equidistant and the default configuration for DeepEP + NVSHMEM results in both GPUs being assigned NIC0 fromnvshmemi_get_devices_by_distance, halving the observed bandwidth for RDMA in test_internode.py and in vLLM wide-EP (because 4 of 8 NIC are enabled).The alternative is a static mapping between GPU host index (PE) and NIC index (HCA), but the
NVSHMEMX_INIT_WITH_UNIQUEIDinitialization method bypasses settingmype_nodeandnpes_node.The
nvshmemi_boot_handle.pg_rankfor this initialization method is always 0 and thenvshmem_boot_handle.pg_sizeis always 2, and as a result mype_node and npes_node are set to 0 and 2 respectively for all PEs initializing. This preventsNVSHMEM_ENABLE_NIC_PE_MAPPING=1from selecting from a static list of devices by mype_node / local rank intransport.cpp#nvshmemi_setup_connections:To allow static assignment, we introduce a new
DEEP_EP_DEVICE_TO_HCA_MAPPINGenvironment variable duringdeep_ep.Bufferinitialization that accepts<cuda_device_id>:<HCA_name>:<HCA_port>and usestorch.cuda.current_device()to setNVSHMEM_HCA_LISTto the appropriate<HCA_name>:<HCA_port>, or error if no such device was listed.We are proposing the change to DeepEP because the choice of initialization method determines how HCA to GPU binding can be achieved across multiple existing NVSHMEM versions. Because vLLM 0.11.0 uses CUDA_VISIBLE_DEVICES to associate GPU devices to each rank, we reverse map the visible devices to the current cuda device.
On GCP we would propose this configuration for all full host workloads on A3U and A4 instance types (H200/B200 + x86 with shared PCIe hubs per 2 NIC / 2 GPU):
Tested with NVSHMEM 3.3.20 and vLLM 0.11.0 and vLLM main @ 2025/10/23.
Co-Authored-By: Keon Jang keonjang@google.com