Skip to content

opal/ofi: follow up fixes to package rank calculation #11710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 24, 2023

Conversation

wenduwan
Copy link
Contributor

We observed new issues caused by the previous change #11693 when the process is unbound:

  1. Excessive error logging
  2. Inconsistent package rank - there was a bug that caused unbound process to be assigned a false package rank. We should return the local rank instead.

wenduwan added 2 commits May 22, 2023 23:27
Make error log less verbose

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
The previous change caused inconsistent package rank when the process is
unbound - two processes could return the same package rank.
If the process is unbound,  OPAL_MODEX_RECV_VALUE_OPTIONAL
throws an error, in which case and we should fallback to the process id
instead of attempting to return a local package rank.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan wenduwan force-pushed the fix_package_rank branch from 55b489f to 1a1b84a Compare May 22, 2023 23:28
@wenduwan wenduwan added the bug label May 22, 2023
@lrbison
Copy link
Contributor

lrbison commented May 24, 2023

Code change looks good. What's the best test for this @wenduwan ?

@wenduwan
Copy link
Contributor Author

What's the best test for this

@lrbison We want to test mpi init with --bind-to none vs --bind-to core when OMPI_MCA_mtl_ofi_verbose=1 is set. We are looking for the logging verbosity.

I've tested this with osu_bw - just need to check the mpi init log. The patch gets rid of the help-common-ofi message, which is our goal.

There is another fix for a bug introduced by Amir's earlier change #11711 which will further reduce the warnings.

@lrbison
Copy link
Contributor

lrbison commented May 24, 2023

Thank you! Looks good to me.

@wenduwan
Copy link
Contributor Author

@lrbison Shall we merge?

@lrbison lrbison merged commit 18a7064 into open-mpi:main May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants