You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(This issue is already fixed in #181, but I'm writing up an issue to document it. I'm writing about Aluminum as it was before that PR.)
The progress engine does some MPI communication to decide how to bind the progress engine thread. This involves collectives being run among the processes on each physical node (i.e., there is no global collective, just concurrent collectives within each node). If progress engine startup is deferred (with AL_PE_START_ON_DEMAND), then this is not executed until the progress engine actually starts. However, if not every rank on a node performs an operation starting the progress engine (e.g., because they're doing a point-to-point operation), then the ranks may hang and the progress engine not fully start.
The text was updated successfully, but these errors were encountered:
(This issue is already fixed in #181, but I'm writing up an issue to document it. I'm writing about Aluminum as it was before that PR.)
The progress engine does some MPI communication to decide how to bind the progress engine thread. This involves collectives being run among the processes on each physical node (i.e., there is no global collective, just concurrent collectives within each node). If progress engine startup is deferred (with
AL_PE_START_ON_DEMAND
), then this is not executed until the progress engine actually starts. However, if not every rank on a node performs an operation starting the progress engine (e.g., because they're doing a point-to-point operation), then the ranks may hang and the progress engine not fully start.The text was updated successfully, but these errors were encountered: