Skip to content

UVM buffers failing in cuIpcGetMemHandle ? #6799

Closed
@paboyle

Description

@paboyle

Background information

I'm running OpenMPI 4.0.1 self compiled over Omnipath with IFS 10.8, as distributed by Intel.

The boards are

  • HPE XA with
  • 4 x Nvidia Volta V100 GPU's and
  • 4 OPA 100Gb ports on two PCIe dual port HFI cards.

The good news is that MPI appears to work between nodes, where these buffers are sent from explicit device memory.

However when I run four MPI ranks per node and ensure that communications between ranks use unified virtual memory (UVM) allocated with cudaMallocManaged(), I get a failure:

r6i6n7.218497 Benchmark_dwf: CUDA failure: cuIpcGetMemHandle() (at /nfs/site/home/phcvs2/gitrepo/ifs-all/Ofed_Delta/rpmbuild/BUILD/libpsm2-11.2.23/ptl_am/am_reqrep_shmem.c:1977)returned 1 
r6i6n7.218497 Error returned from CUDA function.

When I run with a patch to the code to use explicit host memory the code succeeds.
However, I want to be able to run these buffers from UVM and have loops with either host or device execution policy fill them, as that is how the code was designed to operate.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

./configure CC=gcc CXX=g++ --prefix=/home/dp008/dp008/paboyle/Modules/openmpi/install/ --with-psm2-libdir=/lib64/ --with-cuda=/tessfs1/sw/cuda/9.2/ --enable-orterun-prefix-by-default

Compiled with gcc set to 7.3.0

Please describe the system on which you are running

  • Operating system/version:

Redhat Centos 7.4

  • Computer hardware:

HPE XA780i
Dual skylake 4116, 12+12 core.
Two OPA dual port HFI's.
Four V100 SXM2.
96GB RAM.

  • Network type:

Two OPA dual port HFI's.


Details of the problem

When I run four MPI ranks per node and ensure that communications between ranks use unified virtual memory (UVM) allocated with cudaMallocManaged(), I get a failure:

r6i6n7.218497 Benchmark_dwf: CUDA failure: cuIpcGetMemHandle() (at /nfs/site/home/phcvs2/gitrepo/ifs-all/Ofed_Delta/rpmbuild/BUILD/libpsm2-11.2.23/ptl_am/am_reqrep_shmem.c:1977)returned 1 
r6i6n7.218497 Error returned from CUDA function.

When I run with a patch to the code to use explicit host memory the code succeeds.
However, I want to be able to run these buffers from UVM and have loops with either host or device execution policy fill them, as that is how the code was designed to operate.

Running the unmodified code with one rank per node works, so the UVM is working as a source for network traffic, but not as a source for intra-node traffic between GPUs.

Is there something I need to configure differently (I admit this is a complex environment so
I could be missing something !)

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions