-
Notifications
You must be signed in to change notification settings - Fork 937
Update PRRTe and PMIx submodule pointers. #10611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1e5ed79 to
02676d6
Compare
|
Small snag in |
02676d6 to
2d6346f
Compare
|
@wckzhang @wzamazon I can't seem to reproduce the failures AWS CI is hitting: Would you be able to take a look? |
|
Should we expand this error message (and I know this would be a PRRTE thing, not an OMPI thing) to say why it didn't have enough slots? We get people with this error message periodically, and then it's always an issue to chase down why they ran out of slots. It might be useful to include some more local information in the error message, such as:
|
|
@awlauria we (AWS) don't actually manage the CI that runs in AWS, just provide the compute resources. The CI is managed by the OMPI community. Any community member should be able to debug/work on the CI, just talk to Jeff or Brian to get access/information |
|
@jsquyres I agree that would be ideal, and would be a nice enhancement for these situations. For now though, do you have any ideas on how to debug this? I'm unable to reproduce locally, and it is such a trivial case (-np 2) that I find it hard to believe there were no slots available. It must be a prrte bug, and my only thought is that it is using a different bind/mapping protocol on these machines than the default. But it isn't clear to me what it is using. |
|
First step is always to add Remember, I won't see any response unless you specifically flag me. |
|
Thanks, can someone with access take a look to see what's going? |
|
One possibility that occurs to me: is this a node whose topology doesn't include cores? I know there are setups that only report PUs via hwloc. If this is such a setup, then it could fail as the default map/bind combination is by-core. I can add some logic to handle that case, but it would be good to confirm the situation first. |
|
hmm, did that default change in this update? @rhc54 |
|
Not really - for 2 procs, it was always map/bind to core if supported. Only question is if it is working, and if not, then why |
|
Just to update folks: it appears that GitHub will not send me notifications regarding OMPI issues or PRs regardless of how I set the "watch" flags, or even if you directly flag me on a comment. So all I can suggest is that you ping me on the PMIx Slack channel, send me an email, or text my phone so I know there is something you would like me to see. |
|
@jsquyres @bwbarrett in the case where the user does something like:
and the machine only has one physical core, or any case where the This PR brings in changes that eliminates that, and makes the user either add I don't know the history enough to know if the current setup is a bug or a feature, and on the surface this change seems like an improvement. Any thoughts? |
|
I'm generally against breaking things that have worked for at least the last 4 years, but I'm also not sure I should be the sole voice in that decision. |
|
@awlauria, how did you test these changes? I'm getting a segfault with this PR checked out - Segfaulting in hwloc_get_cpubind. |
|
I'm just running mpirun /bin/true |
|
I can't reproduce with my local install. Though I am running with debug install...Have you tried a |
|
It looks like the original problem was reported upstream by @awlauria in openpmix/prrte#1395. @wzamazon If you're running into a new problem (segv's), that should probably be reported upstream, too. |
5409895 to
1fbff85
Compare
|
Updated to the latest prrte/master, which brings in the fix here: openpmix/prrte#1397 |
|
FYI: found a different problem with rankfile and seq mappers - see openpmix/prrte#1398 |
Committed - probably worth updating this to pick it up. |
PMIx commits since last update: 41f4225 - Fix IOF of stdin 7f458b5 - Update the dmodex example 518dc6a - Stop multiple invocations of debugger-release 7b741c4 - src/include/Makefile.am: avoid potential file corruption e78b8e5 - construct_dictionary.py: make .format() safe for Python 2 627731e - Cleanup some debug output 7931d87 - Add "const" qualifiers to some string print APIs 534cf08 - Fix potential use after free in tests d94a533 - Properly cast the list_item_t 1318c07 - Cleanup the pnet/sshot code for picky compilers 60b45ef - Fix dmodex operations eda31e0 - Add CPPFLAGS for pnet/sshot component 0daf8e2 - Fix show_help output to include tools c9e3f09 - Fix PMIX_INFO_*PROCESSED macros b46e350 - Update show-help system 735b29a - Remove bad destruct call 8352b86 - Enable pickyness by default in Git repo builds (openpmix/openpmix#2631) 03a8194 - Hide unused function b83c97f - Return "succeeded" status when outputting help/version info PRRTe commits since last update: 0b580da7c8 - Ensure rankfile and seq mappers computer local and app ranks 6ef02ea7a1 - Allow mapping in overload scenario if bind not specified. c385f74f35 - Add forwarding of stdin to indirect example f3d4089236 - Change the default mapping for --bind-to none option to BYSLOT. a252745b99 - Handle clean shutdown of stdin 090898fe4f - Fix stdin forwarding across nodes 15977f6ecf - Update the dmodex example 550897001d - Return the PMIx version of "not supported" 4b41c8e0a3 - Fix resource usage tracking for map/bind operations ccd3bafcbd - Revert debug commits c7dd6bbecb - REVERT ME ed79411103 - REVERT ME - DEBUG FOR PMIX-TESTS 06254c35d9 - Return zero status when outputting help/version info Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
1fbff85 to
4896db1
Compare
Done. Thanks |
PMIx commits since last update:
41f4225 - Fix IOF of stdin
7f458b5 - Update the dmodex example
518dc6a - Stop multiple invocations of debugger-release
7b741c4 - src/include/Makefile.am: avoid potential file corruption
e78b8e5 - construct_dictionary.py: make .format() safe for Python 2
627731e - Cleanup some debug output
7931d87 - Add "const" qualifiers to some string print APIs
534cf08 - Fix potential use after free in tests
d94a533 - Properly cast the list_item_t
1318c07 - Cleanup the pnet/sshot code for picky compilers
60b45ef - Fix dmodex operations
eda31e0 - Add CPPFLAGS for pnet/sshot component
0daf8e2 - Fix show_help output to include tools
c9e3f09 - Fix PMIX_INFO_*PROCESSED macros
b46e350 - Update show-help system
735b29a - Remove bad destruct call
8352b86 - Enable pickyness by default in Git repo builds (openpmix/openpmix#2631)
03a8194 - Hide unused function
b83c97f - Return "succeeded" status when outputting help/version info
PRRTe commits since last update:
0b580da7c8 - Ensure rankfile and seq mappers computer local and app ranks
6ef02ea7a1 - Allow mapping in overload scenario if bind not specified.
c385f74f35 - Add forwarding of stdin to indirect example
f3d4089236 - Change the default mapping for --bind-to none option to BYSLOT.
a252745b99 - Handle clean shutdown of stdin
090898fe4f - Fix stdin forwarding across nodes
15977f6ecf - Update the dmodex example
550897001d - Return the PMIx version of "not supported"
4b41c8e0a3 - Fix resource usage tracking for map/bind operations
ccd3bafcbd - Revert debug commits
c7dd6bbecb - REVERT ME
ed79411103 - REVERT ME - DEBUG FOR PMIX-TESTS
06254c35d9 - Return zero status when outputting help/version info
Signed-off-by: Austen Lauria awlauria@us.ibm.com