Skip to content

Accelerator aware (OFI) NIC selection #11696

Open
@wenduwan

Description

@wenduwan

Is your feature request related to a problem? Please describe.
Main/v5.0.x branch does have a built-in mechanism to pair accelerator(let's say GPU) with a "nearby" NIC. This has 2 implications for Nvidia GPUDirect RDMA:

  1. Sub-optimal performance. When Open MPI selects a NIC on a different e.g. PCIe root complex from the user-chosen GPU. See https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#supported-systems.

  2. Undefined behavior due to data ordering. According to the same page, GDR requires that "...the two devices must share the same upstream PCI Express root complex. Some of the limitations depend on the platform used and could be lifted in current/future products". Naiively choosing a NIC on a different PCIe root complex from the GPU could in theory cause data ordering issue - imagine a workflow:

    1. NIC(on PCIe rc0) writes data to GPU on PCIe rc1
    2. NIC reports completion to CPU(PCIe rc0)

    This results in undefined behavior - the CPU could process the completion before data arrives at GPU across PCIe rc.

Describe the solution you'd like
At the minimum, we should correctly select the NIC with the shortest distance(measured in some way) from the user selected GPU. In Open MPI 5, we can take advantage of the accelerator framework(we just exposed the PCI attributes from get_device_pci_attr API):

  • If accelerator is not initialized, aka no gpu, we select the NIC close to socket.
  • If accelerator IS initialized, aka user asks for gpu, we select the NIC close to that GPU.

Optionally, for GDR we need to double check that the selected NIC and GPU comply to the above requirement, and throw a error/warning otherwise.

Describe alternatives you've considered
Currently the application must pin the GPU, e.g. set visible cuda device, for each rank according to the PCIe configuration.

Additional context
Related: #11687

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions