Open
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.1.5 relese tarball, configured with
configure_options --with-sge --without-verbs --disable-builtin-atomics --with-libfabric/opt/amazon/efa --enable-orterun-prefix-by-default
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
4.1.5 release tarball
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
N/A
Please describe the system on which you are running
- Operating system/version: Amazon Linux 2
- Computer hardware:EC2. Head node c5.18xlarge(36 cores), Compute nodes g4dn.12xlarge(24 cores)
- Network type: Elastic Fabric Adapter
Details of the problem
Encountered ORTE issue when running the command on head node(36 cores). It works when I run on a compute node though.
mpirun --hostfile host_file_with_8_hosts --map-by ppr:1:node hostname
...
ORTE has lost communication with a remote daemon.
HNP daemon : [[32679,0],0] on node ip-172-31-45-184
Remote daemon: [[32679,0],1] on node queue-g4dn12xlarge-st-g4dn12xlarge-1
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
I manually verified two-way tcp traffic by ssh'ing from each compute node to head node. Did not see any issue.
However, I was able to mitigate by providing --bind-to core/socket
.
I got hints from this post and followed @rhc54 's suggestion to add debugging info. Finally found something interesting. This line failed.
I have to provide these flags:
--mca orte_debug_daemons 1 --mca orte_odls_base_verbose 100 --mca orte_state_base_verbose 100 --mca oob_base_verbose 100 --mca rml_base_verbose 100
In the log, the problematic rank seemingly died around here
...
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] Message posted at grpcomm_direct.c:628 for tag 1
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] orted_cmd: received add_local_procs
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml:base:send_buffer_nb() to peer [[23399,0],0] through conduit 0
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] OOB_SEND: rml_oob_send.c:265
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] ACTIVATE JOB NULL STATE NEVER LAUNCHED AT base/odls_base_default_fns.c:827
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] ACTIVATE JOB NULL STATE FORCED EXIT AT errmgr_default_orted.c:256
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 50
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 51
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 6
...
However, on a successful run(by --bind-to core
) the log is different
...
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] Message posted at grpcomm_direct.c:628 for tag 1
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] orted_cmd: received add_local_procs
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] local:launch
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] odls:dispatch [[17660,1],5] to thread 0
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] odls:launch spawning child [[17660,1],5]
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454]
Env[62]: OMPI_MCA_orte_top_session_dir=/tmp/ompi.queue-g4dn12xlarge-st-g4dn12xlarge-6.1000
Env[63]: OMPI_MCA_orte_jobfam_session_dir=/tmp/ompi.queue-g4dn12xlarge-st-g4dn12xlarge-6.1000/jf.17660
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] ACTIVATE PROC [[17660,1],5] STATE RUNNING AT base/odls_base_default_fns.c:1052
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] ACTIVATE JOB [17660,1] STATE LOCAL LAUNCH COMPLETE AT state_orted.c:297
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] rml:base:send_buffer_nb() to peer [[17660,0],0] through conduit 0
...
I'm not clear on the ORTE side, would appreciate some pointers to understand the exact problem.
TIA!