Hang in progress engine binding #182

ndryden · 2023-03-08T19:07:20Z

(This issue is already fixed in #181, but I'm writing up an issue to document it. I'm writing about Aluminum as it was before that PR.)

The progress engine does some MPI communication to decide how to bind the progress engine thread. This involves collectives being run among the processes on each physical node (i.e., there is no global collective, just concurrent collectives within each node). If progress engine startup is deferred (with AL_PE_START_ON_DEMAND), then this is not executed until the progress engine actually starts. However, if not every rank on a node performs an operation starting the progress engine (e.g., because they're doing a point-to-point operation), then the ranks may hang and the progress engine not fully start.

The text was updated successfully, but these errors were encountered:

ndryden added the bug Something isn't working label Mar 8, 2023

ndryden self-assigned this Mar 8, 2023

ndryden closed this as completed Mar 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang in progress engine binding #182

Hang in progress engine binding #182

ndryden commented Mar 8, 2023

Hang in progress engine binding #182

Hang in progress engine binding #182

Comments

ndryden commented Mar 8, 2023