Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.2, v4.1.1, v4.1.0, and v4.0.7 tested
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
tarball
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
n/a
Please describe the system on which you are running
- Operating system/version: Ubuntu 18.04
- Computer hardware: AWS p2.xlarge (Nvidia K80 GPU)
- Network type: n/a (single node)
Details of the problem
When calling either Ireduce
or Iallreduce
on PyTorch GPU tensors, a segfault occurs. I haven't exhaustively tested all of the ops, but I don't have problems with Reduce, Allreduce, Isend / Irecv, and Ibcast when tested the same way. I haven't tested CuPy tensors, but it might be worthwhile (numba GPU tensors are affected also). This behavior was discovered by leofang mpi4py/mpi4py#164 (comment) while testing mpi4py.
Here is a minimal script that can be used to demonstrate this behavior. The errors are only present when running on GPU:
# mpirun -np 2 python repro.py gpu Ireduce
from mpi4py import MPI
import torch
import sys
if len(sys.argv) < 3:
print('Usage: python repro.py [cpu|gpu] [MPI function to test]')
sys.exit(1)
use_gpu = sys.argv[1] == 'gpu'
func_name = sys.argv[2]
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
if use_gpu:
device = torch.device('cuda:' + str(rank % torch.cuda.device_count()))
else:
device = torch.device('cpu')
def test_Iallreduce():
sendbuf = torch.ones(1, device=device)
recvbuf = torch.empty_like(sendbuf)
torch.cuda.synchronize()
req = comm.Iallreduce(sendbuf, recvbuf, op=MPI.SUM) # also fails with MPI.MAX
req.wait()
assert recvbuf[0] == size
def test_Ireduce():
buf = torch.ones(1, device=device)
if rank == 0:
sendbuf = MPI.IN_PLACE
recvbuf = buf
else:
sendbuf = buf
recvbuf = None
torch.cuda.synchronize()
req = comm.Ireduce(sendbuf, recvbuf, root=0, op=MPI.SUM) # also fails with MPI.MAX
req.wait()
if rank == 0:
assert buf[0] == size
eval('test_' + func_name + '()')
Software/Hardware Versions:
- OpenMPI 4.1.2, 4.1.1, 4.1.0, and 4.0.7 (built w/ --with-cuda flag)
- mpi4py 3.1.3 (built against above MPI version)
- CUDA 11.0
- Python 3.6 (also tested under 3.8)
- Nvidia K80 GPU (also tested with V100)
- OS Ubuntu 18.04 (also tested in containerized environment)
- torch 1.10.1 (w/ GPU support)
You can reproduce my environment setup with the following commands:
wget https://www.open-mpi.org//software/ompi/v3.0/downloads/openmpi-4.1.2.tar.gz
tar xvf openmpi-4.1.2.tar.gz
cd openmpi-4.1.2
./configure --with-cuda --prefix=/opt/openmpi-4.1.2
sudo make -j4 all install
export PATH=/opt/openmpi-4.1.2/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi-4.1.2/lib:$LD_LIBRARY_PATH
env MPICC=/opt/openmpi-4.1.2/bin/mpicc pip install mpi4py
pip install torch numpy
Here is the error message from running Ireduce:
[<host>:25864] *** Process received signal ***
[<host>:25864] Signal: Segmentation fault (11)
[<host>:25864] Signal code: Invalid permissions (2)
[<host>:25864] Failing at address: 0x1201220000
[<host>:25864] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f00efcf3040]
[<host>:25864] [ 1] /opt/openmpi-4.1.2/lib/openmpi/mca_op_avx.so(+0xc079)[0x7f00e41c0079]
[<host>:25864] [ 2] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(+0x7385)[0x7f00d3330385]
[<host>:25864] [ 3] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(NBC_Progress+0x1f3)[0x7f00d3330033]
[<host>:25864] [ 4] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_progress+0x8e)[0x7f00d332e84e]
[<host>:25864] [ 5] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f00edefba3c]
[<host>:25864] [ 6] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5)[0x7f00edf025a5]
[<host>:25864] [ 7] /opt/openmpi-4.1.2/lib/libmpi.so.40(ompi_request_default_wait+0x1f9)[0x7f00ee4eafa9]
[<host>:25864] [ 8] /opt/openmpi-4.1.2/lib/libmpi.so.40(PMPI_Wait+0x52)[0x7f00ee532e02]
[<host>:25864] [ 9] /home/ubuntu/venv/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so(+0xa81e2)[0x7f00ee8911e2]
[<host>:25864] [10] python[0x50a865]
[<host>:25864] [11] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [12] python[0x509989]
[<host>:25864] [13] python[0x50a6bd]
[<host>:25864] [14] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [15] python[0x507f94]
[<host>:25864] [16] python(PyRun_StringFlags+0xaf)[0x63500f]
[<host>:25864] [17] python[0x600911]
[<host>:25864] [18] python[0x50a4ef]
[<host>:25864] [19] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [20] python[0x507f94]
[<host>:25864] [21] python(PyEval_EvalCode+0x23)[0x50b0d3]
[<host>:25864] [22] python[0x634dc2]
[<host>:25864] [23] python(PyRun_FileExFlags+0x97)[0x634e77]
[<host>:25864] [24] python(PyRun_SimpleFileExFlags+0x17f)[0x63862f]
[<host>:25864] [25] python(Py_Main+0x591)[0x6391d1]
[<host>:25864] [26] python(main+0xe0)[0x4b0d30]
[<host>:25864] [27] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f00efcd5bf7]
[<host>:25864] [28] python(_start+0x2a)[0x5b2a5a]
[<host>:25864] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node <host> exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I appreciate any guidance!