-
Notifications
You must be signed in to change notification settings - Fork 912
Open
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.0rc7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From a source tarball.
configured with:
../configure CFLAGS="-g" --prefix=${PREFIX} --with-ucx=${UCX_PREFIX} --disable-man-pages --with-pmix=internal --with-hwloc=internal --with-libevent=internal --without-hcoll
UCX v1.11.0 was configured with:
./contrib/configure-release --prefix=${UCX_PREFIX}
Please describe the system on which you are running
- Operating system/version: Linux 3.10.0-514.26.2.el7.x86_64
- Computer hardware: Intel Xeon Gold 6154 x 2 (36 cores in total)
- Network type: InfiniBand EDR 4x (100Gbps)
Details of the problem
One-sided communication with osc/ucx sometimes causes my program to hang forever.
The hang is in the MPI_Win_flush()
call, and it happens with intra-node execution (flat MPI).
Minimal code to reproduce this behaviour:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
#define CREATE_WIN2 1
#define WIN_ALLOCATE 0
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, nproc;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
size_t b_size = 1024;
#if WIN_ALLOCATE
// If the window is allocated with MPI_Win_allocate, it does not hang
MPI_Win win1;
void* baseptr1;
MPI_Win_allocate(b_size,
1,
MPI_INFO_NULL,
MPI_COMM_WORLD,
&baseptr1,
&win1);
#else
int* buf1 = (int*)malloc(b_size);
MPI_Win win1;
MPI_Win_create(buf1,
b_size,
1,
MPI_INFO_NULL,
MPI_COMM_WORLD,
&win1);
#endif
MPI_Win_lock_all(0, win1);
// If the second window (win2) is not allocated, it does not hang
#if CREATE_WIN2
#if WIN_ALLOCATE
MPI_Win win2;
void* baseptr2;
MPI_Win_allocate(b_size,
1,
MPI_INFO_NULL,
MPI_COMM_WORLD,
&baseptr2,
&win2);
#else
int* buf2 = (int*)malloc(b_size);
MPI_Win win2;
MPI_Win_create(buf2,
b_size,
1,
MPI_INFO_NULL,
MPI_COMM_WORLD,
&win2);
#endif
MPI_Win_lock_all(0, win2);
#endif
if (rank == 0) {
printf("start\n");
}
// execute MPI_Get and MPI_Win_flush for randomly chosen processes
for (int i = 0; i < 10000; i++) {
int t = rank;
do {
t = rand() % nproc;
} while (t == rank);
int b;
MPI_Get(&b, 1, MPI_INT, t, 0, 1, MPI_INT, win1);
MPI_Win_flush(t, win1); // one of the processes hangs here
}
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0) {
printf("end\n");
}
// the rest is for finalization
MPI_Win_unlock_all(win1);
MPI_Win_free(&win1);
#if CREATE_WIN2
MPI_Win_unlock_all(win2);
MPI_Win_free(&win2);
#endif
MPI_Finalize();
if (rank == 0) {
printf("ok\n");
}
return 0;
}
Summarizing what I found:
- The behaviour is non-deterministic. It does not always hang.
- It hangs on intra-node (36 cores in my case).
- When the number of processes is small, it rarely hangs.
- One of the processes is hanging in
MPI_Win_flush()
when the execution gets stuck. - If the second window (
win2
) is not created (CREATE_WIN2=0
), it does not hang. - If
MPI_Win_allocate()
is used instead ofMPI_Win_create()
(WIN_ALLOCATE=1
), it does not hang.
Save the above code (e.g., test_rma.c
), compile and run it repeatedly:
$ mpicc test_rma.c
$ for i in $(seq 1 100); do mpirun -n 36 ./a.out; done
The output will look like:
start
end
ok
...
start
end
ok
start
end
ok
start
<hang>
Checking the behaviour of each process by gdb, I found that one of the processes hangs in MPI_Win_flush()
, while others have already reached MPI_Barrier()
.
Backtrace of hanging process (rank 28):
#0 0x00002ac394e01d03 in opal_thread_internal_mutex_lock (p_mutex=0x2ac39441c949 <progress_callback+45>) at ../../../../../opal/mca/threads/pthreads/threads_pthreads_mutex.h:109
#1 0x00002ac394e01d96 in opal_mutex_lock (mutex=0x16e46f8) at ../../../../../opal/mca/threads/mutex.h:122
#2 0x00002ac394e01f75 in opal_common_ucx_wait_request_mt (request=0x171aa10, msg=0x2ac394e4d798 "ucp_ep_flush_nb") at ../../../../../opal/mca/common/ucx/common_ucx_wpool.h:278
#3 0x00002ac394e0421f in opal_common_ucx_winfo_flush (winfo=0x16e46d0, target=27, type=OPAL_COMMON_UCX_FLUSH_B, scope=OPAL_COMMON_UCX_SCOPE_EP, req_ptr=0x0) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:796
#4 0x00002ac394e042db in opal_common_ucx_wpmem_flush (mem=0x16f51e0, scope=OPAL_COMMON_UCX_SCOPE_EP, target=27) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:838
#5 0x00002ac394422fa2 in ompi_osc_ucx_flush (target=27, win=0x15a6410) at ../../../../../ompi/mca/osc/ucx/osc_ucx_passive_target.c:282
#6 0x00002ac3942f62ed in PMPI_Win_flush (rank=27, win=0x15a6410) at ../../../../ompi/mpi/c/win_flush.c:57
#7 0x0000000000400db1 in main ()
Others:
#0 ucs_callbackq_dispatch (cbq=<optimized out>) at /.../ucx/1.11.0/ucx-1.11.0/src/ucs/datastruct/callbackq.h:211
#1 uct_worker_progress (worker=<optimized out>) at /.../ucx/1.11.0/ucx-1.11.0/src/uct/api/uct.h:2592
#2 ucp_worker_progress (worker=0x25e7540) at core/ucp_worker.c:2635
#3 0x00002ad7b4cfacb0 in opal_common_ucx_wpool_progress (wpool=0x224ec20) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:281
#4 0x00002ad7b4314949 in progress_callback () at ../../../../../ompi/mca/osc/ucx/osc_ucx_component.c:205
#5 0x00002ad7b4c9b334 in opal_progress () at ../../opal/runtime/opal_progress.c:224
#6 0x00002ad7b415aaff in ompi_request_wait_completion (req=0x2370490) at ../../ompi/request/request.h:488
#7 0x00002ad7b415ab68 in ompi_request_default_wait (req_ptr=0x7ffde5f1b0f0, status=0x7ffde5f1b0d0) at ../../ompi/request/req_wait.c:40
#8 0x00002ad7b42299b4 in ompi_coll_base_sendrecv_zero (dest=3, stag=-16, source=3, rtag=-16, comm=0x6023c0 <ompi_mpi_comm_world>) at ../../../../ompi/mca/coll/base/coll_base_barrier.c:64
#9 0x00002ad7b4229d4a in ompi_coll_base_barrier_intra_recursivedoubling (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0) at ../../../../ompi/mca/coll/base/coll_base_barrier.c:210
#10 0x00002ad7b4240672 in ompi_coll_tuned_barrier_intra_do_this (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0, algorithm=3, faninout=0, segsize=0) at ../../../../../ompi/mca/coll/tuned/coll_tuned_barrier_decision.c:101
#11 0x00002ad7b42397e3 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0) at ../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:500
#12 0x00002ad7b418100b in PMPI_Barrier (comm=0x6023c0 <ompi_mpi_comm_world>) at ../../../../ompi/mpi/c/barrier.c:76
#13 0x0000000000400dc8 in main ()