Skip to content

flux MCA plugin that implements fence_nb() as a blocking interface causes deadlock in UCX teardown #11938

@garlick

Description

@garlick

Problem: a 3-node MPI hello world (init/barrier/finalize) sometimes hangs in MPI_Finalize()

Environment:

  • Open MPI v4.1.2, package: Open MPI gyllen@rzwhamo6 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021
  • managed with spack/environment modules
  • corona cluster at LLNL running TOSS 4 (RHEL 8.8)
  • primary resource manager and launcher of MPI programs is Flux

Stack traces show

  • rank 0 is stuck in opal_common_ucx_wait_all_requests ()
  • rank 1,2 are stuck in opal_pmix.fence_nb()

It appears that UCX makes use of the nonblocking nature of fence_nb() and ucp_disconnect_nb() to allow a PMI barrier and the sending of UCX disconnect requests to progress in parallel. But the flux MCA plugin that implements fence_nb() is actually a blocking call. A theory is that sometimes disconnect messages are queued instead of sent directly, and the lack of progress when that rank enters the fence_nb() prevents it from getting out, resulting in the fence never completing and deadlock.

Probably the right solution for users is to use the flux pmix plugin by running with -o pmi=pmix. This is confirmed to resolve this problem. However #8380 effectively converted a segfault due to calling a NULL fence_nb() into a semi-reproduceable hang, arguably not an improvement. Perhaps it would be better to revert it and have UCX treat the lack of a fence_nb() as a fatal runtime error.

Further details: flux-framework/flux-core#5460

Edit: however that was just a theory! Maybe someone who knows ompi/ucx code could confirm or deny?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions