Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jan 21, 2021

As currently written in this release branch, the PML selection "check" logic doesn't guarantee that the caller's PML choice will be checked against only that from MPI_COMM_WORLD rank=0 when a full modex has been performed. This can lead to every process calling "dmodex" to obtain the PML selection of every other process in the job, causing major delay in wireup on first call to communicate.

These cherry-picks contain the updates developed/committed to master after the code in this release branch was brought over to it. One additional cherry-pick was required to cleanly port the code.

bosilca and others added 3 commits January 20, 2021 19:43
With this patch the best PML is selected earlier, before finalizing
the others PML. This provides a simpler mechanism to intercept and
highjack the PML (as done in the monitoring PML)

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 668aa15)
(cherry picked from commit 65fbffa)
For direct modex, all procs publish the selected pml module
and then at add_procs pml module for each proc is checked
against every other proc in the add_proc call.
For full modex, there is no change in functionality. Only Rank0
publishes its selected pml, all other procs in the add_proc call
check their selected pml against Rank0.
If pml's do not match, throw error and exit.

Signed-off-by: Dipti Kothari <dkothar@amazon.com>
(cherry picked from commit 5418cc5)
(cherry picked from commit 5de4423)
Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 56eb572)
@rhc54 rhc54 added this to the v4.0.6 milestone Jan 21, 2021
@rhc54 rhc54 requested review from bosilca and bwbarrett January 21, 2021 03:45
@rhc54 rhc54 self-assigned this Jan 21, 2021
@rhc54 rhc54 marked this pull request as draft January 21, 2021 03:45
@rhc54
Copy link
Contributor Author

rhc54 commented Jan 21, 2021

I have moved this to "draft" status because I believe we need to revisit the PML selection check scheme. Please see #8404 (comment) for an explanation

@rhc54 rhc54 marked this pull request as ready for review January 21, 2021 16:33
@rhc54
Copy link
Contributor Author

rhc54 commented Jan 21, 2021

After conversation, this is good to go!

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 21, 2021

bot:aws:retest

@hppritcha hppritcha merged commit c3fe37d into open-mpi:v4.0.x Jan 26, 2021
@hoopoepg hoopoepg mentioned this pull request Feb 2, 2021
@rhc54 rhc54 deleted the cmr40/pml branch March 18, 2021 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants