flux MCA plugin that implements fence_nb() as a blocking interface causes deadlock in UCX teardown

Problem: a 3-node MPI hello world (init/barrier/finalize) sometimes hangs in `MPI_Finalize()`

Environment:
- Open MPI v4.1.2, package: Open MPI gyllen@rzwhamo6 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021
- managed with spack/environment modules
- corona cluster at LLNL running TOSS 4 (RHEL 8.8)
- primary resource manager and launcher of MPI programs is [Flux](https://flux-framework.readthedocs.io/en/latest/)

Stack traces show 
- rank 0 is stuck in `opal_common_ucx_wait_all_requests ()`
- rank 1,2 are  stuck in `opal_pmix.fence_nb()`

It appears that UCX makes use of the nonblocking nature of `fence_nb()` and `ucp_disconnect_nb()` to allow a PMI barrier and the sending of UCX disconnect requests to progress in parallel.  But the flux MCA plugin that implements `fence_nb()` is actually a blocking call.  A theory is that sometimes disconnect messages are queued instead of sent directly, and the lack of progress when that rank enters the `fence_nb()` prevents it from getting out, resulting in the fence never completing and deadlock.

Probably the right solution for users is to use the flux pmix plugin by running with `-o pmi=pmix`.  This is confirmed to resolve this problem.  However #8380 effectively converted a segfault due to calling a NULL `fence_nb()` into a semi-reproduceable hang, arguably not an improvement.  Perhaps it would be better to revert it and have UCX treat the lack of a `fence_nb()` as a fatal runtime error.

Further details: flux-framework/flux-core#5460

Edit: however that was just a theory!  Maybe someone who knows ompi/ucx code could confirm or deny?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

flux MCA plugin that implements fence_nb() as a blocking interface causes deadlock in UCX teardown #11938

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

flux MCA plugin that implements fence_nb() as a blocking interface causes deadlock in UCX teardown #11938

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions