Skip to content

Intermittent crashes inside MPI_Finalize #10117

Open
@sethrj

Description

@sethrj

Background information

This is moved from openpmix/openpmix#2508 . Our CI experiences roughly a 1 in 1000 chance of crashing or hanging during a normal call to MPI_Finalize at the end of our unit tests (even ones that are launched with only a single processor).

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.2 presently, but I've seen this with a couple of different OpenMPI 4 versions, with internal and external pmix, both installed with Spack. Here are two configurations (primarily differing only in OS) that fail:

-- linux-centos7-x86_64 / gcc@8.5.0.static ----------------------
5az5n35 openmpi@4.1.2~atomics~cuda~cxx~cxx_exceptions~gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker~pmi+romio~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=none schedulers=none
kgmp3uz     hwloc@2.7.0~cairo~cuda~gl~libudev+libxml2~netloc~nvml~opencl+pci~rocm+shared
2lkmzcq         libpciaccess@0.16
lkze2cg         libxml2@2.9.12~python
az6pdin             libiconv@1.16 libs=shared,static
6s2r2ue             xz@5.2.5~pic libs=shared,static
iyy6cky             zlib@1.2.11+optimize+pic+shared
dax2eni         ncurses@6.2~symlinks+termlib abi=none
iebojck     libevent@2.1.12+openssl
nzhpoor         openssl@1.1.1m~docs certs=system
ua64xxh     numactl@2.0.14 patches=4e1d78c,62fc8a8,ff37630
jvkte5j     openssh@7.4p1
7p6hkue     pmix@4.1.2~docs+pmi_backwards_compatibility~restful

and (builtin pmix)

-- linux-rhel6-x86_64 / gcc@8.5.0 -------------------------------
lt6jzov openmpi@4.1.2~atomics~cuda~cxx~cxx_exceptions+gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker~pmi~pmix+romio~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=none schedulers=none
5xkybkw     hwloc@2.7.0~cairo~cuda~gl~libudev+libxml2~netloc~nvml~opencl+pci~rocm+shared
mvud5q4         libpciaccess@0.16
oxzozd2         libxml2@2.9.12~python
4uefnnj             libiconv@1.16 libs=shared,static
vyzv6u3             xz@5.2.5~pic libs=shared,static
e2lizux             zlib@1.2.11+optimize+pic+shared
wbkz7nq         ncurses@6.2~symlinks+termlib abi=none
5i47rbn     libevent@2.1.12+openssl
eippinp         openssl@1.1.1m~docs certs=system
sg5j4u6     numactl@2.0.14 patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006,ff37630df599cfabf0740518b91ec8daaf18e8f288b19adaae5364dc1f6b2296
czcknwg     openssh@5.3p1

I also got a failure on an even older MPI (2.1.6):

-- linux-rhel6-x86_64 / gcc@8.5.0 -------------------------------
rcdlf4z openmpi@2.1.6~atomics~cuda~cxx~cxx_exceptions+gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker~pmi~pmix+romio~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=none patches=d7f08ae74788a15662aeeeaf722e30045b212afe17e19e976d42b3411cc7bc26 schedulers=none
733oqre     hwloc@1.11.13~cairo~cuda~gl~libudev+libxml2~netloc~nvml~opencl+pci~rocm+shared patches=d1d94a4af93486c88c70b79cd930979f3a2a2b5843708e8c7c1655f18b9fc694
mvud5q4         libpciaccess@0.16
oxzozd2         libxml2@2.9.12~python
4uefnnj             libiconv@1.16 libs=shared,static
vyzv6u3             xz@5.2.5~pic libs=shared,static
e2lizux             zlib@1.2.11+optimize+pic+shared
wbkz7nq         ncurses@6.2~symlinks+termlib abi=none
sg5j4u6     numactl@2.0.14 patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006,ff37630df599cfabf0740518b91ec8daaf18e8f288b19adaae5364dc1f6b2296
czcknwg     openssh@5.3p1

Please describe the system on which you are running

  • Operating system/version: centos7/rhel6
  • Computer hardware: zen2/broadwell
  • Network type: local

Details of the problem

Here are stack traces from two failure modes that crash instead of hanging. The first is inside PMIx for OpenMPI 4.1.2:

* thread #1, name = 'tstJoin', stop reason = signal SIGSEGV
  * frame #0: 0x00007faaabf01e08 libpmix.so.2`pmix_notify_check_range + 8
    frame #1: 0x00007faaabf032c8 libpmix.so.2`cycle_events + 1224
    frame #2: 0x00007faaab687c49 libevent_core-2.1.so.7`event_process_active_single_queue(base=0x000000000141cbf0, activeq=0x000000000141a000, max_to_process=2147483647, endtime=0x0000000000000000) at event.c:1691:4
    frame #3: 0x00007faaab68850f libevent_core-2.1.so.7`event_base_loop at event.c:1783:9
    frame #4: 0x00007faaab688454 libevent_core-2.1.so.7`event_base_loop(base=0x000000000141cbf0, flags=<unavailable>) at event.c:2006:12
    frame #5: 0x00007faaabf9ef2e libpmix.so.2`progress_engine + 30
    frame #6: 0x00007faab7900ea5 libpthread.so.0`start_thread + 197
    frame #7: 0x00007faab656fb0d libc.so.6`__clone + 109
* thread #2, stop reason = signal 0
  * frame #0: 0x00007faab7904a35 libpthread.so.0`pthread_cond_wait@@GLIBC_2.3.2 + 197
    frame #1: 0x00007faaabf354cc libpmix.so.2`PMIx_Finalize + 1052
    frame #2: 0x00007faaac3d6622 libopen-pal.so.40`ext3x_client_finalize + 898
    frame #3: 0x00007faaac6b0d75 libopen-rte.so.40`rte_finalize + 85
    frame #4: 0x00007faaac665852 libopen-rte.so.40`orte_finalize + 98
    frame #5: 0x00007faab7dc2aa2 libmpi.so.40`ompi_mpi_finalize + 2274
    frame #6: 0x00007faab83a4fd2 libnemesis.so`nemesis::finalize() at Functions_MPI.cc:161:17
    frame #7: 0x00007faab88a6b86 libNemesisGtest.so`nemesis::gtest_main(argc=<unavailable>, argv=<unavailable>) at Gtest_Functions.cc:285:22

The second is a failure in OpenMPI 2.1.6 itself:

* thread #1, name = 'tstS_Graph', stop reason = signal SIGSEGV
  * frame #0: 0x00007f0c74f9c5c4 libopen-pal.so.20`opal_bitmap_set_bit + 164
    frame #1: 0x00007f0c75127f08 libopen-rte.so.20`mca_oob_usock_component_set_module + 168
    frame #2: 0x00007f0c74ff2479 libopen-pal.so.20`opal_libevent2022_event_base_loop at event.c:1370:5
    frame #3: 0x00007f0c74ff23ec libopen-pal.so.20`opal_libevent2022_event_base_loop at event.c:1440
    frame #4: 0x00007f0c74ff2398 libopen-pal.so.20`opal_libevent2022_event_base_loop(base=0x0000000000a64a30, flags=1) at event.c:1644
    frame #5: 0x00007f0c74fa73ae libopen-pal.so.20`progress_engine + 30
    frame #6: 0x000000351da07aa1 libpthread.so.0`start_thread + 209
    frame #7: 0x000000351d6e8c4d libc.so.6`__clone + 109
* thread #2, stop reason = signal SIGSEGV
  * frame #0: 0x000000351da082fd libpthread.so.0`pthread_join + 269
    frame #1: 0x00007f0c74fa7c2d libopen-pal.so.20`opal_thread_join + 13
    frame #2: 0x00007f0c74fa795c libopen-pal.so.20`opal_progress_thread_finalize + 284
    frame #3: 0x00007f0c7510bb17 libopen-rte.so.20`rte_finalize + 103
    frame #4: 0x00007f0c750cb3ec libopen-rte.so.20`orte_finalize + 92
    frame #5: 0x00007f0c77d2bf02 libmpi.so.20`ompi_mpi_finalize + 1442
    frame #6: 0x00007f0c7e1ab3f2 libNemesis.so.07`nemesis::finalize() at Functions_MPI.cc:161:17
    frame #7: 0x00007f0c7e6c5ac6 libNemesis_gtest.so`nemesis::gtest_main(argc=<unavailable>, argv=<unavailable>) at Gtest_Functions.cc:285:22

Any help would be greatly appreciated; as these errors have been driving us crazy because a 1/1000 failure rate means we never get a successful CI pipeline.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions