Skip to content

osc/ucx: MPI_Win_flush sometimes hangs on intra-node #10559

@s417-lama

Description

@s417-lama

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.0rc7

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From a source tarball.

configured with:

../configure CFLAGS="-g" --prefix=${PREFIX} --with-ucx=${UCX_PREFIX} --disable-man-pages --with-pmix=internal --with-hwloc=internal --with-libevent=internal --without-hcoll

UCX v1.11.0 was configured with:

./contrib/configure-release --prefix=${UCX_PREFIX}

Please describe the system on which you are running

  • Operating system/version: Linux 3.10.0-514.26.2.el7.x86_64
  • Computer hardware: Intel Xeon Gold 6154 x 2 (36 cores in total)
  • Network type: InfiniBand EDR 4x (100Gbps)

Details of the problem

One-sided communication with osc/ucx sometimes causes my program to hang forever.
The hang is in the MPI_Win_flush() call, and it happens with intra-node execution (flat MPI).

Minimal code to reproduce this behaviour:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <mpi.h>

#define CREATE_WIN2 1
#define WIN_ALLOCATE 0

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);

  int rank, nproc;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);

  size_t b_size = 1024;

#if WIN_ALLOCATE
  // If the window is allocated with MPI_Win_allocate, it does not hang
  MPI_Win win1;
  void* baseptr1;
  MPI_Win_allocate(b_size,
                   1,
                   MPI_INFO_NULL,
                   MPI_COMM_WORLD,
                   &baseptr1,
                   &win1);
#else
  int* buf1 = (int*)malloc(b_size);
  MPI_Win win1;
  MPI_Win_create(buf1,
                 b_size,
                 1,
                 MPI_INFO_NULL,
                 MPI_COMM_WORLD,
                 &win1);
#endif
  MPI_Win_lock_all(0, win1);

  // If the second window (win2) is not allocated, it does not hang
#if CREATE_WIN2
#if WIN_ALLOCATE
  MPI_Win win2;
  void* baseptr2;
  MPI_Win_allocate(b_size,
                   1,
                   MPI_INFO_NULL,
                   MPI_COMM_WORLD,
                   &baseptr2,
                   &win2);
#else
  int* buf2 = (int*)malloc(b_size);
  MPI_Win win2;
  MPI_Win_create(buf2,
                 b_size,
                 1,
                 MPI_INFO_NULL,
                 MPI_COMM_WORLD,
                 &win2);
#endif
  MPI_Win_lock_all(0, win2);
#endif

  if (rank == 0) {
    printf("start\n");
  }

  // execute MPI_Get and MPI_Win_flush for randomly chosen processes
  for (int i = 0; i < 10000; i++) {
    int t = rank;
    do {
      t = rand() % nproc;
    } while (t == rank);
    int b;
    MPI_Get(&b, 1, MPI_INT, t, 0, 1, MPI_INT, win1);
    MPI_Win_flush(t, win1); // one of the processes hangs here
  }

  MPI_Barrier(MPI_COMM_WORLD);

  if (rank == 0) {
    printf("end\n");
  }

  // the rest is for finalization
  MPI_Win_unlock_all(win1);
  MPI_Win_free(&win1);

#if CREATE_WIN2
  MPI_Win_unlock_all(win2);
  MPI_Win_free(&win2);
#endif

  MPI_Finalize();

  if (rank == 0) {
    printf("ok\n");
  }

  return 0;
}

Summarizing what I found:

  • The behaviour is non-deterministic. It does not always hang.
  • It hangs on intra-node (36 cores in my case).
  • When the number of processes is small, it rarely hangs.
  • One of the processes is hanging in MPI_Win_flush() when the execution gets stuck.
  • If the second window (win2) is not created (CREATE_WIN2=0), it does not hang.
  • If MPI_Win_allocate() is used instead of MPI_Win_create() (WIN_ALLOCATE=1), it does not hang.

Save the above code (e.g., test_rma.c), compile and run it repeatedly:

$ mpicc test_rma.c
$ for i in $(seq 1 100); do mpirun -n 36 ./a.out; done

The output will look like:

start
end
ok
...
start
end
ok
start
end
ok
start
<hang>

Checking the behaviour of each process by gdb, I found that one of the processes hangs in MPI_Win_flush(), while others have already reached MPI_Barrier().

Backtrace of hanging process (rank 28):

#0  0x00002ac394e01d03 in opal_thread_internal_mutex_lock (p_mutex=0x2ac39441c949 <progress_callback+45>) at ../../../../../opal/mca/threads/pthreads/threads_pthreads_mutex.h:109
#1  0x00002ac394e01d96 in opal_mutex_lock (mutex=0x16e46f8) at ../../../../../opal/mca/threads/mutex.h:122
#2  0x00002ac394e01f75 in opal_common_ucx_wait_request_mt (request=0x171aa10, msg=0x2ac394e4d798 "ucp_ep_flush_nb") at ../../../../../opal/mca/common/ucx/common_ucx_wpool.h:278
#3  0x00002ac394e0421f in opal_common_ucx_winfo_flush (winfo=0x16e46d0, target=27, type=OPAL_COMMON_UCX_FLUSH_B, scope=OPAL_COMMON_UCX_SCOPE_EP, req_ptr=0x0) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:796
#4  0x00002ac394e042db in opal_common_ucx_wpmem_flush (mem=0x16f51e0, scope=OPAL_COMMON_UCX_SCOPE_EP, target=27) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:838
#5  0x00002ac394422fa2 in ompi_osc_ucx_flush (target=27, win=0x15a6410) at ../../../../../ompi/mca/osc/ucx/osc_ucx_passive_target.c:282
#6  0x00002ac3942f62ed in PMPI_Win_flush (rank=27, win=0x15a6410) at ../../../../ompi/mpi/c/win_flush.c:57
#7  0x0000000000400db1 in main ()

Others:

#0  ucs_callbackq_dispatch (cbq=<optimized out>) at /.../ucx/1.11.0/ucx-1.11.0/src/ucs/datastruct/callbackq.h:211
#1  uct_worker_progress (worker=<optimized out>) at /.../ucx/1.11.0/ucx-1.11.0/src/uct/api/uct.h:2592
#2  ucp_worker_progress (worker=0x25e7540) at core/ucp_worker.c:2635
#3  0x00002ad7b4cfacb0 in opal_common_ucx_wpool_progress (wpool=0x224ec20) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:281
#4  0x00002ad7b4314949 in progress_callback () at ../../../../../ompi/mca/osc/ucx/osc_ucx_component.c:205
#5  0x00002ad7b4c9b334 in opal_progress () at ../../opal/runtime/opal_progress.c:224
#6  0x00002ad7b415aaff in ompi_request_wait_completion (req=0x2370490) at ../../ompi/request/request.h:488
#7  0x00002ad7b415ab68 in ompi_request_default_wait (req_ptr=0x7ffde5f1b0f0, status=0x7ffde5f1b0d0) at ../../ompi/request/req_wait.c:40
#8  0x00002ad7b42299b4 in ompi_coll_base_sendrecv_zero (dest=3, stag=-16, source=3, rtag=-16, comm=0x6023c0 <ompi_mpi_comm_world>) at ../../../../ompi/mca/coll/base/coll_base_barrier.c:64
#9  0x00002ad7b4229d4a in ompi_coll_base_barrier_intra_recursivedoubling (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0) at ../../../../ompi/mca/coll/base/coll_base_barrier.c:210
#10 0x00002ad7b4240672 in ompi_coll_tuned_barrier_intra_do_this (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0, algorithm=3, faninout=0, segsize=0) at ../../../../../ompi/mca/coll/tuned/coll_tuned_barrier_decision.c:101
#11 0x00002ad7b42397e3 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0) at ../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:500
#12 0x00002ad7b418100b in PMPI_Barrier (comm=0x6023c0 <ompi_mpi_comm_world>) at ../../../../ompi/mpi/c/barrier.c:76
#13 0x0000000000400dc8 in main ()

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions