CUDA-aware Ireduce and Iallreduce operations for GPU tensors segfault

## Background information

### What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.2, v4.1.1, v4.1.0, and v4.0.7 tested


### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
tarball


### If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.
n/a


### Please describe the system on which you are running

* Operating system/version: Ubuntu 18.04
* Computer hardware: AWS p2.xlarge (Nvidia K80 GPU)
* Network type: n/a (single node)

-----------------------------

## Details of the problem

When calling either `Ireduce` or `Iallreduce` on PyTorch GPU tensors, a segfault occurs. I haven't exhaustively tested all of the ops, but I don't have problems with Reduce, Allreduce, Isend / Irecv, and Ibcast when tested the same way. I haven't tested CuPy tensors, but it might be worthwhile (numba GPU tensors are affected also). This behavior was discovered by leofang https://github.com/mpi4py/mpi4py/issues/164#issuecomment-1007644879 while testing mpi4py.

Here is a minimal script that can be used to demonstrate this behavior. The errors are only present when running on GPU:

```shell
# mpirun -np 2 python repro.py gpu Ireduce
from mpi4py import MPI
import torch
import sys

if len(sys.argv) < 3:
    print('Usage: python repro.py [cpu|gpu] [MPI function to test]')
    sys.exit(1)

use_gpu = sys.argv[1] == 'gpu'
func_name = sys.argv[2]

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
if use_gpu:
    device = torch.device('cuda:' + str(rank % torch.cuda.device_count()))
else:
    device = torch.device('cpu')

def test_Iallreduce():
    sendbuf = torch.ones(1, device=device)
    recvbuf = torch.empty_like(sendbuf)
    torch.cuda.synchronize()
    req = comm.Iallreduce(sendbuf, recvbuf, op=MPI.SUM)  # also fails with MPI.MAX
    req.wait()
    assert recvbuf[0] == size

def test_Ireduce():
    buf = torch.ones(1, device=device)
    if rank == 0:
        sendbuf = MPI.IN_PLACE
        recvbuf = buf
    else:
        sendbuf = buf
        recvbuf = None
    torch.cuda.synchronize()
    req = comm.Ireduce(sendbuf, recvbuf, root=0, op=MPI.SUM)  # also fails with MPI.MAX
    req.wait()
    if rank == 0:
        assert buf[0] == size

eval('test_' + func_name + '()')
```

Software/Hardware Versions:

- OpenMPI 4.1.2, 4.1.1, 4.1.0, and 4.0.7 (built w/ --with-cuda flag)
- mpi4py 3.1.3 (built against above MPI version)
- CUDA 11.0
- Python 3.6 (also tested under 3.8)
- Nvidia K80 GPU (also tested with V100)
- OS Ubuntu 18.04 (also tested in containerized environment)
- torch 1.10.1 (w/ GPU support)


You can reproduce my environment setup with the following commands:

```bash
wget https://www.open-mpi.org//software/ompi/v3.0/downloads/openmpi-4.1.2.tar.gz
tar xvf openmpi-4.1.2.tar.gz
cd openmpi-4.1.2
./configure --with-cuda --prefix=/opt/openmpi-4.1.2
sudo make -j4 all install
export PATH=/opt/openmpi-4.1.2/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi-4.1.2/lib:$LD_LIBRARY_PATH
env MPICC=/opt/openmpi-4.1.2/bin/mpicc pip install mpi4py
pip install torch numpy
```

Here is the error message from running Ireduce:

```
[<host>:25864] *** Process received signal ***
[<host>:25864] Signal: Segmentation fault (11)
[<host>:25864] Signal code: Invalid permissions (2)
[<host>:25864] Failing at address: 0x1201220000
[<host>:25864] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f00efcf3040]
[<host>:25864] [ 1] /opt/openmpi-4.1.2/lib/openmpi/mca_op_avx.so(+0xc079)[0x7f00e41c0079]
[<host>:25864] [ 2] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(+0x7385)[0x7f00d3330385]
[<host>:25864] [ 3] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(NBC_Progress+0x1f3)[0x7f00d3330033]
[<host>:25864] [ 4] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_progress+0x8e)[0x7f00d332e84e]
[<host>:25864] [ 5] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f00edefba3c]
[<host>:25864] [ 6] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5)[0x7f00edf025a5]
[<host>:25864] [ 7] /opt/openmpi-4.1.2/lib/libmpi.so.40(ompi_request_default_wait+0x1f9)[0x7f00ee4eafa9]
[<host>:25864] [ 8] /opt/openmpi-4.1.2/lib/libmpi.so.40(PMPI_Wait+0x52)[0x7f00ee532e02]
[<host>:25864] [ 9] /home/ubuntu/venv/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so(+0xa81e2)[0x7f00ee8911e2]
[<host>:25864] [10] python[0x50a865]
[<host>:25864] [11] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [12] python[0x509989]
[<host>:25864] [13] python[0x50a6bd]
[<host>:25864] [14] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [15] python[0x507f94]
[<host>:25864] [16] python(PyRun_StringFlags+0xaf)[0x63500f]
[<host>:25864] [17] python[0x600911]
[<host>:25864] [18] python[0x50a4ef]
[<host>:25864] [19] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [20] python[0x507f94]
[<host>:25864] [21] python(PyEval_EvalCode+0x23)[0x50b0d3]
[<host>:25864] [22] python[0x634dc2]
[<host>:25864] [23] python(PyRun_FileExFlags+0x97)[0x634e77]
[<host>:25864] [24] python(PyRun_SimpleFileExFlags+0x17f)[0x63862f]
[<host>:25864] [25] python(Py_Main+0x591)[0x6391d1]
[<host>:25864] [26] python(main+0xe0)[0x4b0d30]
[<host>:25864] [27] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f00efcd5bf7]
[<host>:25864] [28] python(_start+0x2a)[0x5b2a5a]
[<host>:25864] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node <host> exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
```

I appreciate any guidance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA-aware Ireduce and Iallreduce operations for GPU tensors segfault #9845

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA-aware Ireduce and Iallreduce operations for GPU tensors segfault #9845

Description

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.