Skip to content

Conversation

@awlauria
Copy link
Contributor

@awlauria awlauria commented Jul 26, 2022

PMIx commits since last update:

41f4225 - Fix IOF of stdin
7f458b5 - Update the dmodex example
518dc6a - Stop multiple invocations of debugger-release
7b741c4 - src/include/Makefile.am: avoid potential file corruption
e78b8e5 - construct_dictionary.py: make .format() safe for Python 2
627731e - Cleanup some debug output
7931d87 - Add "const" qualifiers to some string print APIs
534cf08 - Fix potential use after free in tests
d94a533 - Properly cast the list_item_t
1318c07 - Cleanup the pnet/sshot code for picky compilers
60b45ef - Fix dmodex operations
eda31e0 - Add CPPFLAGS for pnet/sshot component
0daf8e2 - Fix show_help output to include tools
c9e3f09 - Fix PMIX_INFO_*PROCESSED macros
b46e350 - Update show-help system
735b29a - Remove bad destruct call
8352b86 - Enable pickyness by default in Git repo builds (openpmix/openpmix#2631)
03a8194 - Hide unused function
b83c97f - Return "succeeded" status when outputting help/version info

PRRTe commits since last update:

0b580da7c8 - Ensure rankfile and seq mappers computer local and app ranks
6ef02ea7a1 - Allow mapping in overload scenario if bind not specified.
c385f74f35 - Add forwarding of stdin to indirect example
f3d4089236 - Change the default mapping for --bind-to none option to BYSLOT.
a252745b99 - Handle clean shutdown of stdin
090898fe4f - Fix stdin forwarding across nodes
15977f6ecf - Update the dmodex example
550897001d - Return the PMIx version of "not supported"
4b41c8e0a3 - Fix resource usage tracking for map/bind operations
ccd3bafcbd - Revert debug commits
c7dd6bbecb - REVERT ME
ed79411103 - REVERT ME - DEBUG FOR PMIX-TESTS
06254c35d9 - Return zero status when outputting help/version info

Signed-off-by: Austen Lauria awlauria@us.ibm.com

@awlauria
Copy link
Contributor Author

Small snag in --bind-to none CI tests - should be resolved by: openpmix/prrte#1390

@awlauria
Copy link
Contributor Author

@wckzhang @wzamazon I can't seem to reproduce the failures AWS CI is hitting:

--> Running example: hello_c
--------------------------------------------------------------------------
Either there are not enough slots available in the system to launch
the 2 processes that were requested by the application, or there are
not enough CPUs to bind them as requested:

  App: ./examples/hello_c
  Mapping: BYCORE
  Binding: CORE

Either request fewer processes for your application, make more slots
available for use by expanding the allocation, or do not bind the
processes so that the number of CPUs is no longer a limiting factor.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch. Similarly, you can use the --bind-to :OVERLOAD option to bind
more than one process to a CPU, if desired, or --bind-to NONE to avoid
binding altogether.
--------------------------------------------------------------------------
Example failed: 213
Command was: timeout -s SIGSEGV 4m mpirun --get-stack-traces --timeout 180 --hostfile /home/ubuntu/workspace/open-mpi.build.compilers/Compiler/gcc10/hostfile -np 2  ./examples/hello_c
Build step 'Execute shell' marked build as failure
Finished: FAILURE

Would you be able to take a look?

@jsquyres
Copy link
Member

Should we expand this error message (and I know this would be a PRRTE thing, not an OMPI thing) to say why it didn't have enough slots?

We get people with this error message periodically, and then it's always an issue to chase down why they ran out of slots. It might be useful to include some more local information in the error message, such as:

  • how many slots are available on the local host
  • how many processes tried to be launched on the local host
  • indeed, if the number of hosts is under some threshold, maybe emit the entire list of hosts and how many slots were used on each host
  • where the slots came from: a hostfile, a SLURM allocation, a Torque allocation, ... etc. (i.e., perhaps just the name of the component that provided the slots)
  • ...any other information that we typically end up asking users for when they ask us about this error message

@wckzhang
Copy link
Contributor

@awlauria we (AWS) don't actually manage the CI that runs in AWS, just provide the compute resources. The CI is managed by the OMPI community. Any community member should be able to debug/work on the CI, just talk to Jeff or Brian to get access/information

@awlauria
Copy link
Contributor Author

@jsquyres I agree that would be ideal, and would be a nice enhancement for these situations.

For now though, do you have any ideas on how to debug this? I'm unable to reproduce locally, and it is such a trivial case (-np 2) that I find it hard to believe there were no slots available. It must be a prrte bug, and my only thought is that it is using a different bind/mapping protocol on these machines than the default. But it isn't clear to me what it is using.

@rhc54
Copy link
Contributor

rhc54 commented Jul 28, 2022

First step is always to add --display allocation to the cmd line so you can see what it thinks the allocation looks like. Second is to add --prtemca rmaps_base_verbose 5 to see what the mappers are doing.

Remember, I won't see any response unless you specifically flag me.

@awlauria
Copy link
Contributor Author

Thanks, can someone with access take a look to see what's going?

@rhc54
Copy link
Contributor

rhc54 commented Jul 28, 2022

One possibility that occurs to me: is this a node whose topology doesn't include cores? I know there are setups that only report PUs via hwloc. If this is such a setup, then it could fail as the default map/bind combination is by-core. I can add some logic to handle that case, but it would be good to confirm the situation first.

@awlauria
Copy link
Contributor Author

hmm, did that default change in this update? @rhc54

@rhc54
Copy link
Contributor

rhc54 commented Jul 28, 2022

Not really - for 2 procs, it was always map/bind to core if supported. Only question is if it is working, and if not, then why

@rhc54
Copy link
Contributor

rhc54 commented Jul 30, 2022

Just to update folks: it appears that GitHub will not send me notifications regarding OMPI issues or PRs regardless of how I set the "watch" flags, or even if you directly flag me on a comment. So all I can suggest is that you ping me on the PMIx Slack channel, send me an email, or text my phone so I know there is something you would like me to see.

@awlauria
Copy link
Contributor Author

awlauria commented Aug 1, 2022

@jsquyres @bwbarrett in the case where the user does something like:

localhost cpus=2

and the machine only has one physical core, or any case where the cpus/slots=y exceeds what is on the node, on the v4.0.x and current main/v5.0.x branch orte/prte currently will abandon ship not bind at all, and will launch the processes unbound. Is that the expected desired behavior?

This PR brings in changes that eliminates that, and makes the user either add --oversubscribe, fix their hostfile to match the number of slots/cores on the system, or use hwthreads instead if those exceed physical cores.

I don't know the history enough to know if the current setup is a bug or a feature, and on the surface this change seems like an improvement. Any thoughts?

@bwbarrett
Copy link
Member

I'm generally against breaking things that have worked for at least the last 4 years, but I'm also not sure I should be the sole voice in that decision.

@wckzhang
Copy link
Contributor

wckzhang commented Aug 1, 2022

@awlauria, how did you test these changes? I'm getting a segfault with this PR checked out -

[ip-172-31-10-115:25694] [prterun-ip-172-31-10-115-25694@0,0] plm:ssh: final template argv:
	/usr/bin/ssh <template> PRTE_PREFIX=/home/ec2-user/install;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/ec2-user/install/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/ec2-user/install/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/ec2-user/install/bin/prted --leave-session-attached --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-10-115-25694@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "5" --prtemca prte_hnp_uri "prterun-ip-172-31-10-115-25694@0.0;tcp://127.0.0.1,172.31.10.115:57417:8,20" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-10-115-25694@0.0;tcp://127.0.0.1,172.31.10.115:57417:8,20"
Warning: Permanently added 'compute-st-c5n18xlarge-1,172.31.4.124' (ECDSA) to the list of known hosts.
Warning: Permanently added 'compute-st-c5n18xlarge-4,172.31.12.127' (ECDSA) to the list of known hosts.
Warning: Permanently added 'compute-st-c5n18xlarge-3,172.31.13.250' (ECDSA) to the list of known hosts.
Warning: Permanently added 'compute-st-c5n18xlarge-2,172.31.12.27' (ECDSA) to the list of known hosts.
[ip-172-31-10-115:25694] ALIASES FOR NODE compute-st-c5n18xlarge-3 (compute-st-c5n18xlarge-3)
[ip-172-31-10-115:25694] 	ALIAS: 172.31.13.250
[ip-172-31-10-115:25694] ALIASES FOR NODE compute-st-c5n18xlarge-2 (compute-st-c5n18xlarge-2)
[ip-172-31-10-115:25694] 	ALIAS: 172.31.12.27
./test.sh: line 3: 25694 Segmentation fault      (core dumped) ~/install/bin/mpirun -np 144 --map-by ppr:36:node --output tag --leave-session-attached --prtemca plm_base_verbose 5 --rank-by slot --hostfile ~/hostfile /bin/true
(gdb) bt
#0  0x00007f1ef7b9c2d7 in hwloc_get_cpubind () from /lib64/libhwloc.so.5
#1  0x00007f1ef91b9ed1 in prte_hwloc_base_setup_summary () from /home/ec2-user/install/lib/libprrte.so.2
#2  0x00007f1ef92205a9 in prte_plm_base_daemon_callback () from /home/ec2-user/install/lib/libprrte.so.2
#3  0x00007f1ef91be756 in prte_rml_base_process_msg () from /home/ec2-user/install/lib/libprrte.so.2
#4  0x00007f1ef7fd63ad in event_base_loop () from /lib64/libevent_core-2.0.so.5
#5  0x0000000000404298 in main ()

Segfaulting in hwloc_get_cpubind.

@wckzhang
Copy link
Contributor

wckzhang commented Aug 1, 2022

I'm just running mpirun /bin/true

@awlauria
Copy link
Contributor Author

awlauria commented Aug 2, 2022

I can't reproduce with my local install. Though I am running with debug install...Have you tried a --enable-debug build? Can probably glean some more info from it..

@jsquyres
Copy link
Member

jsquyres commented Aug 2, 2022

It looks like the original problem was reported upstream by @awlauria in openpmix/prrte#1395.

@wzamazon If you're running into a new problem (segv's), that should probably be reported upstream, too.

@wzamazon
Copy link
Contributor

wzamazon commented Aug 2, 2022

@jsquyres I think you meant @wckzhang

@awlauria awlauria force-pushed the main_update_subs branch 2 times, most recently from 5409895 to 1fbff85 Compare August 2, 2022 18:18
@awlauria
Copy link
Contributor Author

awlauria commented Aug 2, 2022

Updated to the latest prrte/master, which brings in the fix here: openpmix/prrte#1397

@rhc54
Copy link
Contributor

rhc54 commented Aug 2, 2022

FYI: found a different problem with rankfile and seq mappers - see openpmix/prrte#1398

@rhc54
Copy link
Contributor

rhc54 commented Aug 2, 2022

FYI: found a different problem with rankfile and seq mappers - see openpmix/prrte#1398

Committed - probably worth updating this to pick it up.

PMIx commits since last update:

41f4225 - Fix IOF of stdin
7f458b5 - Update the dmodex example
518dc6a - Stop multiple invocations of debugger-release
7b741c4 - src/include/Makefile.am: avoid potential file corruption
e78b8e5 - construct_dictionary.py: make .format() safe for Python 2
627731e - Cleanup some debug output
7931d87 - Add "const" qualifiers to some string print APIs
534cf08 - Fix potential use after free in tests
d94a533 - Properly cast the list_item_t
1318c07 - Cleanup the pnet/sshot code for picky compilers
60b45ef - Fix dmodex operations
eda31e0 - Add CPPFLAGS for pnet/sshot component
0daf8e2 - Fix show_help output to include tools
c9e3f09 - Fix PMIX_INFO_*PROCESSED macros
b46e350 - Update show-help system
735b29a - Remove bad destruct call
8352b86 - Enable pickyness by default in Git repo builds (openpmix/openpmix#2631)
03a8194 - Hide unused function
b83c97f - Return "succeeded" status when outputting help/version info

PRRTe commits since last update:

0b580da7c8 - Ensure rankfile and seq mappers computer local and app ranks
6ef02ea7a1 - Allow mapping in overload scenario if bind not specified.
c385f74f35 - Add forwarding of stdin to indirect example
f3d4089236 - Change the default mapping for --bind-to none option to BYSLOT.
a252745b99 - Handle clean shutdown of stdin
090898fe4f - Fix stdin forwarding across nodes
15977f6ecf - Update the dmodex example
550897001d - Return the PMIx version of "not supported"
4b41c8e0a3 - Fix resource usage tracking for map/bind operations
ccd3bafcbd - Revert debug commits
c7dd6bbecb - REVERT ME
ed79411103 - REVERT ME - DEBUG FOR PMIX-TESTS
06254c35d9 - Return zero status when outputting help/version info

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
@awlauria awlauria merged commit 7b15c3d into open-mpi:main Aug 2, 2022
@awlauria awlauria deleted the main_update_subs branch August 2, 2022 21:01
@awlauria
Copy link
Contributor Author

awlauria commented Aug 2, 2022

FYI: found a different problem with rankfile and seq mappers - see openpmix/prrte#1398

Committed - probably worth updating this to pick it up.

Done. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants