Skip to content

Crash in with Open MPI v4.1.x in MPI_Win_lock_all when used with libfabric < 1.12 #9123

Open
@Flamefire

Description

@Flamefire

Background information

We were using OpenMPI 4.0.5 with libfabric 1.11.0 for MPI one-sided communication. When upgrading to OpenMPI 4.1 MPI_Win_lock_all crashes.

Using libfabric 1.12.x works. However as OMPI 4.0 works with libfabric 1.11 this rather looks like an OMPI bug and upgrading libfabric may not be easily possible

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.0.x (specifically 4.0.5) works, 4.1.0 & 4.1.1 crashes

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Source

Please describe the system on which you are running

  • Operating system/version: RHEL 7.9
  • Computer hardware: Intel x86
  • Network type: Infiniband

Details of the problem

The following program crashes when compiled and run with mpirun over more than 1 node:
test_mpi2.cpp.txt

Output:

[taurusi6584:18594:0:18594] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)
==== backtrace (tid:  18594) ====
 0 0x00000000000234f3 ucs_debug_print_backtrace()  /dev/shm/easybuild-build/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/ucs/debug/debug.c:656
 1 0x00000000000102e1 rxd_start_xfer.cold()  rxd_cq.c:0
 2 0x0000000000073a16 rxd_progress_tx_list()  crtstuff.c:0
 3 0x000000000007547b rxd_handle_recv_comp()  crtstuff.c:0
 4 0x00000000000781a5 rxd_ep_progress()  crtstuff.c:0
 5 0x000000000002ed3d ofi_cq_progress()  crtstuff.c:0
 6 0x000000000002e09e ofi_cq_readfrom()  crtstuff.c:0
 7 0x0000000000006da7 mca_btl_ofi_context_progress()  ???:0
 8 0x0000000000003e8e mca_btl_ofi_component_progress()  btl_ofi_component.c:0
 9 0x00000000000313ab opal_progress()  ???:0
10 0x0000000000018335 ompi_osc_rdma_lock_all_atomic()  ???:0
11 0x000000000009b203 MPI_Win_lock_all()  ???:0
12 0x00000000004017e1 main()  ???:0
13 0x0000000000022555 __libc_start_main()  ???:0
14 0x0000000000401529 _start()  ???:0
=================================
[taurusi6584:18594] *** Process received signal ***
[taurusi6584:18594] Signal: Segmentation fault (11)
[taurusi6584:18594] Signal code:  (-6)
[taurusi6584:18594] Failing at address: 0xf51cf000048a2
[taurusi6584:18594] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b3dcca1c630]
[taurusi6584:18594] [ 1] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x102e1)[0x2b3dcdad92e1]
[taurusi6584:18594] [ 2] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x73a16)[0x2b3dcdb3ca16]
[taurusi6584:18594] [ 3] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x7547b)[0x2b3dcdb3e47b]
[taurusi6584:18594] [ 4] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x781a5)[0x2b3dcdb411a5]
[taurusi6584:18594] [ 5] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x2ed3d)[0x2b3dcdaf7d3d]
[taurusi6584:18594] [ 6] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0/lib/libfabric.so.1(+0x2e09e)[0x2b3dcdaf709e]
[taurusi6584:18594] [ 7] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/OpenMPI/4.1.1-GCC-10.2.0/lib/openmpi/mca_btl_ofi.so(mca_btl_ofi_context_progress+0x57)[0x2b3dcdabfda7]
[taurusi6584:18594] [ 8] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/OpenMPI/4.1.1-GCC-10.2.0/lib/openmpi/mca_btl_ofi.so(+0x3e8e)[0x2b3dcdabce8e]
[taurusi6584:18594] [ 9] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/OpenMPI/4.1.1-GCC-10.2.0/lib/libopen-pal.so.40(opal_progress+0x2b)[0x2b3dcc1693ab]
[taurusi6584:18594] [10] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/OpenMPI/4.1.1-GCC-10.2.0/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_lock_all_atomic+0x335)[0x2b3dcfd69335]
[taurusi6584:18594] [11] /beegfs/global0/ws/s3248973-easybuild/openMPINew/software/OpenMPI/4.1.1-GCC-10.2.0/lib/libmpi.so.40(PMPI_Win_lock_all+0xb3)[0x2b3dcb661203]
[taurusi6584:18594] [12] a2.out[0x4017e1]
[taurusi6584:18594] [13] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3dcbcd0555]
[taurusi6584:18594] [14] a2.out[0x401529]
[taurusi6584:18594] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 18594 on node taurusi6584 exited on signal 11 (Segmentation fault).
  Configure command line: '--build=x86_64-pc-linux-gnu'
                          '--host=x86_64-pc-linux-gnu' '--with-slurm'
                          '--with-pmi=/usr' '--with-pmi-libdir=/usr/lib64'
                          '--with-knem=/opt/knem-1.1.3.90mlnx1'
                          '--enable-mpirun-prefix-by-default'
                          '--enable-shared' '--with-cuda=no'
                          '--with-hwloc=/sw/installed/hwloc/2.2.0-GCCcore-10.2.0'
                          '--with-libevent=/sw/installed/libevent/2.1.12-GCCcore-10.2.0'
                          '--with-ofi=/beegfs/global0/ws/s3248973-easybuild/openMPINew/software/libfabric/1.11.0-GCCcore-10.2.0'
                          '--with-pmix=/sw/installed/PMIx/3.1.5-GCCcore-10.2.0'
                          '--with-ucx=/sw/installed/UCX/1.9.0-GCCcore-10.2.0'
                          '--without-verbs'

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions