Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in progress engine binding #182

Closed
ndryden opened this issue Mar 8, 2023 · 0 comments
Closed

Hang in progress engine binding #182

ndryden opened this issue Mar 8, 2023 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@ndryden
Copy link
Collaborator

ndryden commented Mar 8, 2023

(This issue is already fixed in #181, but I'm writing up an issue to document it. I'm writing about Aluminum as it was before that PR.)

The progress engine does some MPI communication to decide how to bind the progress engine thread. This involves collectives being run among the processes on each physical node (i.e., there is no global collective, just concurrent collectives within each node). If progress engine startup is deferred (with AL_PE_START_ON_DEMAND), then this is not executed until the progress engine actually starts. However, if not every rank on a node performs an operation starting the progress engine (e.g., because they're doing a point-to-point operation), then the ranks may hang and the progress engine not fully start.

@ndryden ndryden added the bug Something isn't working label Mar 8, 2023
@ndryden ndryden self-assigned this Mar 8, 2023
@ndryden ndryden closed this as completed Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant