When direct launching applications, we must allow the MPI layer to pr… #1668

rhc54 · 2016-05-12T19:52:38Z

…ogress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization.

rhc54 · 2016-05-12T19:55:11Z

@hppritcha Please take a look at the Cray support - I think this is applicable there as well, but I can't verify it.

jladd-mlnx · 2016-05-13T14:11:53Z

@rhc54 I've been watching MLNX Jenkins on most PRs and this error is new and looks to be related to your change. It's coming from PMIx.

rhc54 · 2016-05-13T14:19:12Z

Agreed - that's why I just pushed an update 😄

rhc54 · 2016-05-13T23:33:46Z

@sjeaugey @bosilca This is an alternative solution to 7373111 that I believe is more efficient and restores the "lazy" barrier behavior during finalize

sjeaugey · 2016-05-13T23:40:52Z

Did you mean @jsquyres ?

rhc54 · 2016-05-13T23:46:00Z

sorry, no - should have been @nysal

bosilca · 2016-05-14T18:26:11Z

I like the comments in ompi/runtime/ompi_mpi_finalize.c (especially the one for the nb case). Except the fact that the MPI barrier has been removed I fail to see any fundamental difference between the current and the newly proposed solution. I might have missed something but why is the MPI barrier unnecessary ?

rhc54 · 2016-05-14T19:35:53Z

Sorry about that - I'll restore the comments explaining the cases. Looking this over again, I do believe I made a mistake when bringing this back across from the 1.10 series. We do still need the MPI_Barrier in those cases where the fence is blocking during finalize. The only change there should have been to use OMPI_LAZY_WAIT_FOR_COMPLETION.

The change in MPI_Init was required due to some transports requiring the need to progress even during modex and especially during the barrier at the end of MPI_Init. I couldn't find a nice way to do that in a macro form, though I'm still looking at it.

rhc54 · 2016-05-14T22:02:29Z

@bosilca I have updated this per the comments. The primary change here is that I made the fence in the SLURM and Cray pmix components all be non-blocking, thus allowing us to make progress while we are waiting for the fence to complete. Thus, the MPI barrier in MPI_Finalize will never actually be executed with today's opal/pmix components.

However, I left it there in case someone someday adds another opal/pmix component that doesn't have a non-blocking fence. I doubt that will happen, but...one never knows.

One question I still have: in MPI_Init, when we do the modex, we still make everyone wait at that place until the fence completes (unless they specify async_modex, in which case we just fall thru). Given a non-blocking fence, we could allow the process to continue executing steps in MPI_Init until it reached a point where the fence absolutely must have completed - at that point, we could introduce the "wait_for_complete". In many cases, the fence will already have completed by that time, and so we won't wait. But if it hadn't completed, we could wait there.

Are there any steps in MPI_Init after modex and before the next fence where it would make sense to allow a process to continue? Given that we have the ability to fall-thru when async_modex is provided, I'm wondering if this is possible.

…ogress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization. Update external as well Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro

hppritcha · 2016-05-16T15:05:35Z

bot:retest

hppritcha · 2016-05-16T16:30:21Z

Given this change, which thread is going to run the fence operation? This seems like it could have a significant impact on performance of the fence operation, esp. at scale.

hppritcha · 2016-05-16T16:35:01Z

👎
doesn't work for me.
Here's output of running ring_c.c on 1 process using srun launch. OMPI_MCA_pmix_base_verbose=10:

n17276@tiger:/cray/css/users/n17276/ompi/examples> (topic/slurm *)srun -n 1 ./ring_c
srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1
[nid00008:04911] mca: base: components_register: registering framework pmix components
[nid00008:04911] mca: base: components_register: found loaded component cray
[nid00008:04911] mca: base: components_register: component cray has no register or open function
[nid00008:04911] mca: base: components_open: opening pmix components
[nid00008:04911] mca: base: components_open: found loaded component cray
[nid00008:04911] mca: base: components_open: component cray open function successful
[nid00008:04911] mca:base:select: Auto-selecting pmix components
[nid00008:04911] mca:base:select:( pmix) Querying component [cray]
[nid00008:04911] mca:base:select:( pmix) Query of component [cray] set priority to 90
[nid00008:04911] mca:base:select:( pmix) Selected component [cray]
[nid00008:04911] [[37225,0],0] pmix:cray: assigned tmp name -1855389696 0 pmix_kvs_name kvs_20002279785
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:hash:store storing data for proc [[37225,0],0]
[nid00008:04911] [[37225,0],0] pmix:cray getting value for proc [[37225,0],0] key pmix.lrank
[nid00008:04911] [[37225,0],0] pmix:cray getting value for proc [[37225,0],0] key pmix.nrank
[nid00008:04911] [[37225,0],0] pmix:cray getting value for proc [[37225,0],0] key pmix.max.size
[nid00008:04911] [[37225,0],0] pmix:cray fetch from dstore failed: -13
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting max procs failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

hppritcha · 2016-05-16T16:37:22Z

I'm confused as to the problem we're trying to address here. Things were work fine for cray srun/aprun MPI_Init prior to this PR. Note there's an issue in finalize with oob shutdown, but that's different and is being tracked by issue #1527.

rhc54 · 2016-05-16T16:51:28Z

The failure you hit is due to the Cray pmi component not being updated during a prior change - I can fix that here. Has nothing to do with this change.

The issue that was resolved in master and subsequently moved to 2.0 occurred because the fence during finalize did not progress the MPI layer. Thus, we needed a way to ensure that all MPI comm was completed prior to exiting the RTE barrier in finalize. We added an MPI_Barrier to do that, but that means we are doing two barriers in finalize.

This change pushes the RTE barrier into the async progress thread that ORTE is running, and then lets the MPI layer loop on opal_progress while waiting for the RTE barrier to complete. It has zero impact on scalability - the PMIx component will allow things to progress asynchronously, while the Cray and SLURM components will immediately block the ORTE thread in their PMI calls.

So we get the required behavior without introducing a second barrier operation.

hppritcha · 2016-05-16T17:05:23Z

okay, i'll cherry-pick 01ba861 on top of master and see what happens.

hppritcha · 2016-05-16T17:47:47Z

okay, fixing first problem with PR #1672

hppritcha · 2016-05-16T17:56:54Z

this works for me with #1672 applied. I'll merge that PR after CI checks complete.

rhc54 · 2016-05-16T18:07:46Z

Thanks!

rhc54 added bug Target: v2.x labels May 12, 2016

rhc54 merged commit 8b534e9 into open-mpi:master May 16, 2016

rhc54 deleted the topic/slurm branch May 16, 2016 19:23

When direct launching applications, we must allow the MPI layer to pr… #1668

When direct launching applications, we must allow the MPI layer to pr… #1668

Uh oh!

Conversation

rhc54 commented May 12, 2016

Uh oh!

rhc54 commented May 12, 2016

Uh oh!

jladd-mlnx commented May 13, 2016

Uh oh!

rhc54 commented May 13, 2016

Uh oh!

rhc54 commented May 13, 2016

Uh oh!

sjeaugey commented May 13, 2016

Uh oh!

rhc54 commented May 13, 2016

Uh oh!

bosilca commented May 14, 2016

Uh oh!

rhc54 commented May 14, 2016

Uh oh!

rhc54 commented May 14, 2016

Uh oh!

hppritcha commented May 16, 2016

Uh oh!

hppritcha commented May 16, 2016

Uh oh!

hppritcha commented May 16, 2016

Uh oh!

hppritcha commented May 16, 2016

Uh oh!

rhc54 commented May 16, 2016

Uh oh!

hppritcha commented May 16, 2016

Uh oh!

hppritcha commented May 16, 2016

Uh oh!

hppritcha commented May 16, 2016

Uh oh!

rhc54 commented May 16, 2016

Uh oh!

Uh oh!