Skip to content

Conversation

@smarterclayton
Copy link

@smarterclayton smarterclayton commented Oct 24, 2025

The nvshmemi_get_devices_by_distance default initialization method in NVSHMEM does not work optimally for GPU configurations where 2 GPU and 2 RDMA NIC share a PCIe bus, such as the x86 based GCP A3 Ultra H200 and A4 B200 instance types: https://cloud.google.com/compute/docs/gpus/gpu-network-bandwidth#h200-gpus. GPU0 and GPU1 (on two independent processes) observe NIC0 and NIC1 on the same PCIe switch as being equidistant and the default configuration for DeepEP + NVSHMEM results in both GPUs being assigned NIC0 from nvshmemi_get_devices_by_distance, halving the observed bandwidth for RDMA in test_internode.py and in vLLM wide-EP (because 4 of 8 NIC are enabled).

The alternative is a static mapping between GPU host index (PE) and NIC index (HCA), but the NVSHMEMX_INIT_WITH_UNIQUEID initialization method bypasses setting mype_node and npes_node.

The nvshmemi_boot_handle.pg_rank for this initialization method is always 0 and the nvshmem_boot_handle.pg_size is always 2, and as a result mype_node and npes_node are set to 0 and 2 respectively for all PEs initializing. This prevents NVSHMEM_ENABLE_NIC_PE_MAPPING=1 from selecting from a static list of devices by mype_node / local rank in transport.cpp#nvshmemi_setup_connections:

selected_devices[0] =
  nvshmemi_state->mype_node % (tcurr->n_devices > 0
    ? tcurr->n_devices : 1);

To allow static assignment, we introduce a new DEEP_EP_DEVICE_TO_HCA_MAPPING environment variable during deep_ep.Buffer initialization that accepts <cuda_device_id>:<HCA_name>:<HCA_port> and uses torch.cuda.current_device() to set NVSHMEM_HCA_LIST to the appropriate <HCA_name>:<HCA_port>, or error if no such device was listed.

We are proposing the change to DeepEP because the choice of initialization method determines how HCA to GPU binding can be achieved across multiple existing NVSHMEM versions. Because vLLM 0.11.0 uses CUDA_VISIBLE_DEVICES to associate GPU devices to each rank, we reverse map the visible devices to the current cuda device.

On GCP we would propose this configuration for all full host workloads on A3U and A4 instance types (H200/B200 + x86 with shared PCIe hubs per 2 NIC / 2 GPU):

DEEP_EP_DEVICE_TO_HCA_MAPPING=0:mlx5_0:1,1:mlx5_1:1,2:mlx5_2:1,3:mlx5_3:1,4:mlx5_4:1,5:mlx5_5:1,6:mlx5_6:1,7:mlx5_7:1

Tested with NVSHMEM 3.3.20 and vLLM 0.11.0 and vLLM main @ 2025/10/23.

Co-Authored-By: Keon Jang keonjang@google.com

@smarterclayton smarterclayton changed the title Allow NVSHMEM PE to NIC to be initialized by rank Allow NVSHMEM PE to NIC mapping to be initialized by DeepEP rank Oct 24, 2025
smarterclayton and others added 2 commits October 29, 2025 14:52
The `nvshmemi_get_devices_by_distance` default initialization
method in NVSHMEM does not work optimally for GPU configurations
where 2 GPU and 2 RDMA NIC share a PCIe bus, such as the x86
based GCP A3 Ultra H200 and A4 B200 instance types:
https://cloud.google.com/compute/docs/gpus/gpu-network-bandwidth#h200-gpus.
GPU0 and GPU1 (on two independent processes) can observe NIC0, NIC1
on the same PCIe switch are equidistant and result in both GPUs
leveraging NIC0, halving the observed bandwidth for RDMA in
test_internode.py and in vLLM wide-EP.

The alternative is a static mapping between GPU host index (PE) and
NIC index (HCA), but the NVSHMEMX_INIT_WITH_UNIQUEID
initialization method bypasses setting `mype_node` and `npes_node`.
The `nvshmemi_boot_handle.pg_rank` for this initialization method
is always 0 and the `nvshmem_boot_handle.pg_size` is always 2,
preventing NVSHMEM_ENABLE_NIC_PE_MAPPING from leveraging a static
list of devices in transport.cpp#nvshmemi_setup_connections:

    selected_devices[0] =
      nvshmemi_state->mype_node % (tcurr->n_devices > 0
        ? tcurr->n_devices : 1);

has mype_node = 0 for all devices.

To allow static assignment, introduce a DEEP_EP_DEVICE_TO_HCA_MAPPING
environment variable during Buffer python initialization that accepts
`<cuda_device_id>:<HCA_name>:<HCA_port>` and resolves
`torch.cuda.current_device()` to set NVSHMEM_HCA_LIST to the
appropriate value or error.

Co-Authored-By: Keon Jang <keonjang@google.com>
Signed-off-by: Clayton Coleman <smarterclayton@gmail.com>
Perform a mapping between integer CUDA_VISIBLE_DEVICES values to
find the host current device for DEEP_EP_DEVICE_TO_HCA_MAPPING.

Signed-off-by: Clayton Coleman <smarterclayton@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant