-
Notifications
You must be signed in to change notification settings - Fork 902
When direct launching applications, we must allow the MPI layer to pr… #1668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@hppritcha Please take a look at the Cray support - I think this is applicable there as well, but I can't verify it. |
@rhc54 I've been watching MLNX Jenkins on most PRs and this error is new and looks to be related to your change. It's coming from PMIx. |
Agreed - that's why I just pushed an update 😄 |
Did you mean @jsquyres ? |
sorry, no - should have been @nysal |
I like the comments in ompi/runtime/ompi_mpi_finalize.c (especially the one for the nb case). Except the fact that the MPI barrier has been removed I fail to see any fundamental difference between the current and the newly proposed solution. I might have missed something but why is the MPI barrier unnecessary ? |
Sorry about that - I'll restore the comments explaining the cases. Looking this over again, I do believe I made a mistake when bringing this back across from the 1.10 series. We do still need the MPI_Barrier in those cases where the fence is blocking during finalize. The only change there should have been to use OMPI_LAZY_WAIT_FOR_COMPLETION. The change in MPI_Init was required due to some transports requiring the need to progress even during modex and especially during the barrier at the end of MPI_Init. I couldn't find a nice way to do that in a macro form, though I'm still looking at it. |
@bosilca I have updated this per the comments. The primary change here is that I made the fence in the SLURM and Cray pmix components all be non-blocking, thus allowing us to make progress while we are waiting for the fence to complete. Thus, the MPI barrier in MPI_Finalize will never actually be executed with today's opal/pmix components. However, I left it there in case someone someday adds another opal/pmix component that doesn't have a non-blocking fence. I doubt that will happen, but...one never knows. One question I still have: in MPI_Init, when we do the modex, we still make everyone wait at that place until the fence completes (unless they specify async_modex, in which case we just fall thru). Given a non-blocking fence, we could allow the process to continue executing steps in MPI_Init until it reached a point where the fence absolutely must have completed - at that point, we could introduce the "wait_for_complete". In many cases, the fence will already have completed by that time, and so we won't wait. But if it hadn't completed, we could wait there. Are there any steps in MPI_Init after modex and before the next fence where it would make sense to allow a process to continue? Given that we have the ability to fall-thru when async_modex is provided, I'm wondering if this is possible. |
…ogress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization. Update external as well Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro
bot:retest |
Given this change, which thread is going to run the fence operation? This seems like it could have a significant impact on performance of the fence operation, esp. at scale. |
👎
|
I'm confused as to the problem we're trying to address here. Things were work fine for cray srun/aprun MPI_Init prior to this PR. Note there's an issue in finalize with oob shutdown, but that's different and is being tracked by issue #1527. |
The failure you hit is due to the Cray pmi component not being updated during a prior change - I can fix that here. Has nothing to do with this change. The issue that was resolved in master and subsequently moved to 2.0 occurred because the fence during finalize did not progress the MPI layer. Thus, we needed a way to ensure that all MPI comm was completed prior to exiting the RTE barrier in finalize. We added an MPI_Barrier to do that, but that means we are doing two barriers in finalize. This change pushes the RTE barrier into the async progress thread that ORTE is running, and then lets the MPI layer loop on opal_progress while waiting for the RTE barrier to complete. It has zero impact on scalability - the PMIx component will allow things to progress asynchronously, while the Cray and SLURM components will immediately block the ORTE thread in their PMI calls. So we get the required behavior without introducing a second barrier operation. |
okay, i'll cherry-pick 01ba861 on top of master and see what happens. |
okay, fixing first problem with PR #1672 |
this works for me with #1672 applied. I'll merge that PR after CI checks complete. |
Thanks! |
…ogress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization.