Open
Description
Background information
We were using OpenMPI 4.0.5 with libfabric 1.11.0 for MPI one-sided communication. When upgrading to OpenMPI 4.1 MPI_Win_lock_all crashes.
Using libfabric 1.12.x works. However as OMPI 4.0 works with libfabric 1.11 this rather looks like an OMPI bug and upgrading libfabric may not be easily possible
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.0.x (specifically 4.0.5) works, 4.1.0 & 4.1.1 crashes
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Source
Please describe the system on which you are running
- Operating system/version: RHEL 7.9
- Computer hardware: Intel x86
- Network type: Infiniband
Details of the problem
The following program crashes when compiled and run with mpirun
over more than 1 node:
test_mpi2.cpp.txt
Output:
[taurusi6584:18594:0:18594] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)
==== backtrace (tid: 18594) ====
0 0x00000000000234f3 ucs_debug_print_backtrace() /dev/shm/easybuild-build/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/ucs/debug/debug.c:656
1 0x00000000000102e1 rxd_start_xfer.cold() rxd_cq.c:0
2 0x0000000000073a16 rxd_progress_tx_list() crtstuff.c:0
3 0x000000000007547b rxd_handle_recv_comp() crtstuff.c:0
4 0x00000000000781a5 rxd_ep_progress() crtstuff.c:0
5 0x000000000002ed3d ofi_cq_progress() crtstuff.c:0
6 0x000000000002e09e ofi_cq_readfrom() crtstuff.c:0
7 0x0000000000006da7 mca_btl_ofi_context_progress() ???:0
8 0x0000000000003e8e mca_btl_ofi_component_progress() btl_ofi_component.c:0
9 0x00000000000313ab opal_progress() ???:0
10 0x0000000000018335 ompi_osc_rdma_lock_all_atomic() ???:0
11 0x000000000009b203 MPI_Win_lock_all() ???:0
12 0x00000000004017e1 main() ???:0
13 0x0000000000022555 __libc_start_main() ???:0
14 0x0000000000401529 _start() ???:0
=================================
[taurusi6584:18594] *** Process received signal ***
[taurusi6584:18594] Signal: Segmentation fault (11)
[taurusi6584:18594] Signal code: (-6)
[taurusi6584:18594] Failing at address: 0xf51cf000048a2
[taurusi6584:18594] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b3dcca1c630]
[taurusi6584:18594] [ 1] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x102e1)[0x2b3dcdad92e1]
[taurusi6584:18594] [ 2] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x73a16)[0x2b3dcdb3ca16]
[taurusi6584:18594] [ 3] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x7547b)[0x2b3dcdb3e47b]
[taurusi6584:18594] [ 4] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x781a5)[0x2b3dcdb411a5]
[taurusi6584:18594] [ 5] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x2ed3d)[0x2b3dcdaf7d3d]
[taurusi6584:18594] [ 6] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x2e09e)[0x2b3dcdaf709e]
[taurusi6584:18594] [ 7] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/OpenMPI/4.1.1-GCC-10.2.0/lib/openmpi/mca_btl_ofi.so(mca_btl_ofi_context_progress+0x57)[0x2b3dcdabfda7]
[taurusi6584:18594] [ 8] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/OpenMPI/4.1.1-GCC-10.2.0/lib/openmpi/mca_btl_ofi.so(+0x3e8e)[0x2b3dcdabce8e]
[taurusi6584:18594] [ 9] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/OpenMPI/4.1.1-GCC-10.2.0/lib/libopen-pal.so.40(opal_progress+0x2b)[0x2b3dcc1693ab]
[taurusi6584:18594] [10] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/OpenMPI/4.1.1-GCC-10.2.0/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_lock_all_atomic+0x335)[0x2b3dcfd69335]
[taurusi6584:18594] [11] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/OpenMPI/4.1.1-GCC-10.2.0/lib/libmpi.so.40(PMPI_Win_lock_all+0xb3)[0x2b3dcb661203]
[taurusi6584:18594] [12] a2.out[0x4017e1]
[taurusi6584:18594] [13] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3dcbcd0555]
[taurusi6584:18594] [14] a2.out[0x401529]
[taurusi6584:18594] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 18594 on node taurusi6584 exited on signal 11 (Segmentation fault).
Configure command line: '--build=x86_64-pc-linux-gnu'
'--host=x86_64-pc-linux-gnu' '--with-slurm'
'--with-pmi=/usr' '--with-pmi-libdir=/usr/lib64'
'--with-knem=/opt/knem-1.1.3.90mlnx1'
'--enable-mpirun-prefix-by-default'
'--enable-shared' '--with-cuda=no'
'--with-hwloc=/sw/installed/hwloc/2.2.0-GCCcore-10.2.0'
'--with-libevent=/sw/installed/libevent/2.1.12-GCCcore-10.2.0'
'--with-ofi=/beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0'
'--with-pmix=/sw/installed/PMIx/3.1.5-GCCcore-10.2.0'
'--with-ucx=/sw/installed/UCX/1.9.0-GCCcore-10.2.0'
'--without-verbs'