-
Notifications
You must be signed in to change notification settings - Fork 929
Description
Problem: a 3-node MPI hello world (init/barrier/finalize) sometimes hangs in MPI_Finalize()
Environment:
- Open MPI v4.1.2, package: Open MPI gyllen@rzwhamo6 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021
- managed with spack/environment modules
- corona cluster at LLNL running TOSS 4 (RHEL 8.8)
- primary resource manager and launcher of MPI programs is Flux
Stack traces show
- rank 0 is stuck in
opal_common_ucx_wait_all_requests () - rank 1,2 are stuck in
opal_pmix.fence_nb()
It appears that UCX makes use of the nonblocking nature of fence_nb() and ucp_disconnect_nb() to allow a PMI barrier and the sending of UCX disconnect requests to progress in parallel. But the flux MCA plugin that implements fence_nb() is actually a blocking call. A theory is that sometimes disconnect messages are queued instead of sent directly, and the lack of progress when that rank enters the fence_nb() prevents it from getting out, resulting in the fence never completing and deadlock.
Probably the right solution for users is to use the flux pmix plugin by running with -o pmi=pmix. This is confirmed to resolve this problem. However #8380 effectively converted a segfault due to calling a NULL fence_nb() into a semi-reproduceable hang, arguably not an improvement. Perhaps it would be better to revert it and have UCX treat the lack of a fence_nb() as a fatal runtime error.
Further details: flux-framework/flux-core#5460
Edit: however that was just a theory! Maybe someone who knows ompi/ucx code could confirm or deny?