Skip to content

OMPI + UCX segfault during MPI_Finalize #7968

Closed
@bosilca

Description

@bosilca

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Master with a full set of UCX flavors: 1.7, 1.8 1.9 and master. All UCX versions were configured with --with-cuda --with-avx --enable-mt

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git pull
configure --prefix=... --enable-picky --enable-debug --disable-heterogeneous --enable-visibility \
          --enable-mpirun-prefix-by-default --with-pmix=internal --with-ucx=/opt/ucx/XXX/fast

Please describe the system on which you are running

Scientific Linux release 7.4 (Nitrogen)
Cluster of Mellanox 56gbps

hca_id:	mthca0
	transport:			InfiniBand (0)
	fw_ver:				4.8.200

Details of the problem

Any app run on 2 nodes, including the most trivial benchmark, results in a failure during MPI_Finalize. Everything seems to run fine on a single node,

 0 0x0000000000053b83 ucs_debug_print_backtrace()  src/ucs/debug/debug.c:653
 1 0x000000000001cb8b uct_rc_verbs_handle_failure()  src/uct/ib/rc/verbs/rc_verbs_iface.c:63
 2 0x000000000001d195 uct_rc_verbs_iface_poll_tx()  src/uct/ib/rc/verbs/rc_verbs_iface.c:104
 3 0x000000000001d195 uct_rc_verbs_iface_progress()  src/uct/ib/rc/verbs/rc_verbs_iface.c:132
 4 0x000000000002a79a ucs_callbackq_dispatch()  src/ucs/datastruct/callbackq.h:211
 5 0x000000000002a79a uct_worker_progress()  src/uct/api/uct.h:2221
 6 0x000000000002a79a ucp_worker_progress()  src/ucp/core/ucp_worker.c:1951
 7 0x0000000000002d0c opal_common_ucx_wait_request()  ompi/build/debug/opal/mca/common/ucx/../../../../../../opal/mca/common/ucx/common_ucx.h:182
 8 0x00000000000032f8 opal_common_ucx_wait_all_requests()  ompi/build/debug/opal/mca/common/ucx/../../../../../../opal/mca/common/ucx/common_ucx.c:210
 9 0x0000000000003574 opal_common_ucx_del_procs_nofence()  ompi/build/debug/opal/mca/common/ucx/../../../../../../opal/mca/common/ucx/common_ucx.c:254
10 0x00000000000035fb opal_common_ucx_del_procs()  ompi/build/debug/opal/mca/common/ucx/../../../../../../opal/mca/common/ucx/common_ucx.c:272
11 0x0000000000006877 mca_pml_ucx_del_procs()  ompi/build/debug/ompi/mca/pml/ucx/../../../../../../ompi/mca/pml/ucx/pml_ucx.c:481
12 0x0000000000071510 ompi_mpi_finalize()  ompi/build/debug/ompi/../../../ompi/runtime/ompi_mpi_finalize.c:326
13 0x00000000000b7cb1 PMPI_Finalize()  ompi/build/debug/ompi/mpi/c/profile/pfinalize.c:54
14 0x0000000000400b35 main()  osu/osu_latency.c:305
15 0x00000000000223d5 __libc_start_main()  ???:0
16 0x0000000000400db2 _start()  ???:0

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions