Closed
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Master with a full set of UCX flavors: 1.7, 1.8 1.9 and master. All UCX versions were configured with --with-cuda --with-avx --enable-mt
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git pull
configure --prefix=... --enable-picky --enable-debug --disable-heterogeneous --enable-visibility \
--enable-mpirun-prefix-by-default --with-pmix=internal --with-ucx=/opt/ucx/XXX/fast
Please describe the system on which you are running
Scientific Linux release 7.4 (Nitrogen)
Cluster of Mellanox 56gbps
hca_id: mthca0
transport: InfiniBand (0)
fw_ver: 4.8.200
Details of the problem
Any app run on 2 nodes, including the most trivial benchmark, results in a failure during MPI_Finalize. Everything seems to run fine on a single node,
0 0x0000000000053b83 ucs_debug_print_backtrace() src/ucs/debug/debug.c:653
1 0x000000000001cb8b uct_rc_verbs_handle_failure() src/uct/ib/rc/verbs/rc_verbs_iface.c:63
2 0x000000000001d195 uct_rc_verbs_iface_poll_tx() src/uct/ib/rc/verbs/rc_verbs_iface.c:104
3 0x000000000001d195 uct_rc_verbs_iface_progress() src/uct/ib/rc/verbs/rc_verbs_iface.c:132
4 0x000000000002a79a ucs_callbackq_dispatch() src/ucs/datastruct/callbackq.h:211
5 0x000000000002a79a uct_worker_progress() src/uct/api/uct.h:2221
6 0x000000000002a79a ucp_worker_progress() src/ucp/core/ucp_worker.c:1951
7 0x0000000000002d0c opal_common_ucx_wait_request() ompi/build/debug/opal/mca/common/ucx/../../../../../../opal/mca/common/ucx/common_ucx.h:182
8 0x00000000000032f8 opal_common_ucx_wait_all_requests() ompi/build/debug/opal/mca/common/ucx/../../../../../../opal/mca/common/ucx/common_ucx.c:210
9 0x0000000000003574 opal_common_ucx_del_procs_nofence() ompi/build/debug/opal/mca/common/ucx/../../../../../../opal/mca/common/ucx/common_ucx.c:254
10 0x00000000000035fb opal_common_ucx_del_procs() ompi/build/debug/opal/mca/common/ucx/../../../../../../opal/mca/common/ucx/common_ucx.c:272
11 0x0000000000006877 mca_pml_ucx_del_procs() ompi/build/debug/ompi/mca/pml/ucx/../../../../../../ompi/mca/pml/ucx/pml_ucx.c:481
12 0x0000000000071510 ompi_mpi_finalize() ompi/build/debug/ompi/../../../ompi/runtime/ompi_mpi_finalize.c:326
13 0x00000000000b7cb1 PMPI_Finalize() ompi/build/debug/ompi/mpi/c/profile/pfinalize.c:54
14 0x0000000000400b35 main() osu/osu_latency.c:305
15 0x00000000000223d5 __libc_start_main() ???:0
16 0x0000000000400db2 _start() ???:0