Skip to content

When direct launching applications, we must allow the MPI layer to pr… #1668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 16, 2016
Merged

When direct launching applications, we must allow the MPI layer to pr… #1668

merged 1 commit into from
May 16, 2016

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented May 12, 2016

…ogress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization.

@rhc54
Copy link
Contributor Author

rhc54 commented May 12, 2016

@hppritcha Please take a look at the Cray support - I think this is applicable there as well, but I can't verify it.

@jladd-mlnx
Copy link
Member

@rhc54 I've been watching MLNX Jenkins on most PRs and this error is new and looks to be related to your change. It's coming from PMIx.

@rhc54
Copy link
Contributor Author

rhc54 commented May 13, 2016

Agreed - that's why I just pushed an update 😄

@rhc54
Copy link
Contributor Author

rhc54 commented May 13, 2016

@sjeaugey @bosilca This is an alternative solution to 7373111 that I believe is more efficient and restores the "lazy" barrier behavior during finalize

@sjeaugey
Copy link
Member

Did you mean @jsquyres ?

@rhc54
Copy link
Contributor Author

rhc54 commented May 13, 2016

sorry, no - should have been @nysal

@bosilca
Copy link
Member

bosilca commented May 14, 2016

I like the comments in ompi/runtime/ompi_mpi_finalize.c (especially the one for the nb case). Except the fact that the MPI barrier has been removed I fail to see any fundamental difference between the current and the newly proposed solution. I might have missed something but why is the MPI barrier unnecessary ?

@rhc54
Copy link
Contributor Author

rhc54 commented May 14, 2016

Sorry about that - I'll restore the comments explaining the cases. Looking this over again, I do believe I made a mistake when bringing this back across from the 1.10 series. We do still need the MPI_Barrier in those cases where the fence is blocking during finalize. The only change there should have been to use OMPI_LAZY_WAIT_FOR_COMPLETION.

The change in MPI_Init was required due to some transports requiring the need to progress even during modex and especially during the barrier at the end of MPI_Init. I couldn't find a nice way to do that in a macro form, though I'm still looking at it.

@rhc54
Copy link
Contributor Author

rhc54 commented May 14, 2016

@bosilca I have updated this per the comments. The primary change here is that I made the fence in the SLURM and Cray pmix components all be non-blocking, thus allowing us to make progress while we are waiting for the fence to complete. Thus, the MPI barrier in MPI_Finalize will never actually be executed with today's opal/pmix components.

However, I left it there in case someone someday adds another opal/pmix component that doesn't have a non-blocking fence. I doubt that will happen, but...one never knows.

One question I still have: in MPI_Init, when we do the modex, we still make everyone wait at that place until the fence completes (unless they specify async_modex, in which case we just fall thru). Given a non-blocking fence, we could allow the process to continue executing steps in MPI_Init until it reached a point where the fence absolutely must have completed - at that point, we could introduce the "wait_for_complete". In many cases, the fence will already have completed by that time, and so we won't wait. But if it hadn't completed, we could wait there.

Are there any steps in MPI_Init after modex and before the next fence where it would make sense to allow a process to continue? Given that we have the ability to fall-thru when async_modex is provided, I'm wondering if this is possible.

…ogress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization.

Update external as well

Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro
@hppritcha
Copy link
Member

bot:retest

@hppritcha
Copy link
Member

Given this change, which thread is going to run the fence operation? This seems like it could have a significant impact on performance of the fence operation, esp. at scale.

@hppritcha
Copy link
Member

👎
doesn't work for me.
Here's output of running ring_c.c on 1 process using srun launch. OMPI_MCA_pmix_base_verbose=10:

n17276@tiger:/cray/css/users/n17276/ompi/examples> (topic/slurm *)srun -n 1 ./ring_c
srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1
[nid00008:04911] mca: base: components_register: registering framework pmix components
[nid00008:04911] mca: base: components_register: found loaded component cray
[nid00008:04911] mca: base: components_register: component cray has no register or open function
[nid00008:04911] mca: base: components_open: opening pmix components
[nid00008:04911] mca: base: components_open: found loaded component cray
[nid00008:04911] mca: base: components_open: component cray open function successful
[nid00008:04911] mca:base:select: Auto-selecting pmix components
[nid00008:04911] mca:base:select:( pmix) Querying component [cray]
[nid00008:04911] mca:base:select:( pmix) Query of component [cray] set priority to 90
[nid00008:04911] mca:base:select:( pmix) Selected component [cray]
[nid00008:04911] [[37225,0],0] pmix:cray: assigned tmp name -1855389696 0 pmix_kvs_name kvs_20002279785
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:cray getting value for proc [[37225,0],0] key pmix.lrank
[nid00008:04911] [[37225,0],0] pmix:cray getting value for proc [[37225,0],0] key pmix.nrank
[nid00008:04911] [[37225,0],0] pmix:cray getting value for proc [[37225,0],0] key pmix.max.size
[nid00008:04911] [[37225,0],0] pmix:cray fetch from dstore failed: -13
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting max procs failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

@hppritcha
Copy link
Member

I'm confused as to the problem we're trying to address here. Things were work fine for cray srun/aprun MPI_Init prior to this PR. Note there's an issue in finalize with oob shutdown, but that's different and is being tracked by issue #1527.

@rhc54
Copy link
Contributor Author

rhc54 commented May 16, 2016

The failure you hit is due to the Cray pmi component not being updated during a prior change - I can fix that here. Has nothing to do with this change.

The issue that was resolved in master and subsequently moved to 2.0 occurred because the fence during finalize did not progress the MPI layer. Thus, we needed a way to ensure that all MPI comm was completed prior to exiting the RTE barrier in finalize. We added an MPI_Barrier to do that, but that means we are doing two barriers in finalize.

This change pushes the RTE barrier into the async progress thread that ORTE is running, and then lets the MPI layer loop on opal_progress while waiting for the RTE barrier to complete. It has zero impact on scalability - the PMIx component will allow things to progress asynchronously, while the Cray and SLURM components will immediately block the ORTE thread in their PMI calls.

So we get the required behavior without introducing a second barrier operation.

@hppritcha
Copy link
Member

okay, i'll cherry-pick 01ba861 on top of master and see what happens.

@hppritcha
Copy link
Member

okay, fixing first problem with PR #1672

@hppritcha
Copy link
Member

this works for me with #1672 applied. I'll merge that PR after CI checks complete.

@rhc54
Copy link
Contributor Author

rhc54 commented May 16, 2016

Thanks!

@rhc54 rhc54 merged commit 8b534e9 into open-mpi:master May 16, 2016
@rhc54 rhc54 deleted the topic/slurm branch May 16, 2016 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants