Skip to content

CUDA + MPI_TYPE_INDEXED + MPI_SEND/RECV super slow #12202

Open
@chhu

Description

@chhu

Background information

We developed a CFD code some years ago for arbitrary 3D shapes and parallelize via domain decomposition. To exchange halo/ghost cell data we used gather/scatter loops to combine data into a linear buffer before sending. That worked very well and surprisingly so for GPU2GPU, even with the detour of copying that buffer into host mem first, freeing MPI from knowing/caring about device-memory. Now, as multi-GPU environments become standard, we wanted to see if we can improve parallelism by transferring device2device memory with CUDA aware OpenMPI.

To make the code even more simple we switched to TYPE_INDEXED and sent / received directly with that type instead of first creating a linear buffer. This works very well in classic HPC environment (about 5% faster than prev method), but simply applying the same logic to device memory has tremendous impact on speed (factor 500 slower). The detour with linearizing first in device memory via cusparseGather() and cusparseScatter() yielded at least the same speed as before (with / without the detour to host memory).

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Tested with OMPI4.1 and latest 5.1 branch, same outcome.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Compiled on system with latest UCX + CUDA12.3.1

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

+088c88499514c3c55aee0c433d6f4fb46b0c3587 3rd-party/openpmix (v1.1.3-3973-g088c8849)
bc5de986e13c14db575141714a8b5e0089f4d0bf 3rd-party/prrte (psrvr-v2.0.0rc1-4741-gbc5de986e1)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: RedHat8
  • Computer hardware: Dual AMD Epyc 3rd gen with 8x NVIDIA RTX A5000
  • Network type: MLX5 (not used)

Details of the problem

  • Exchanging ghost data with TYPE_INDEXED and MPI_ISEND / IRECV creates massive slow-down (x500) compared to linearizing buffer first
  • To my surprise, exchanging buffers GPU2GPU is almost identical than copy to host first and exchange there. Am I doing something wrong or am I expecting too much? (not using gdr copy yet!)

Thanks a lot for your help!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions