Skip to content

[4.1.5] ORTE has lost communication with a remote daemon #11830

Open
@wenduwan

Description

@wenduwan

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.5 relese tarball, configured with

configure_options --with-sge --without-verbs --disable-builtin-atomics --with-libfabric/opt/amazon/efa --enable-orterun-prefix-by-default

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

4.1.5 release tarball

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

N/A

Please describe the system on which you are running

  • Operating system/version: Amazon Linux 2
  • Computer hardware:EC2. Head node c5.18xlarge(36 cores), Compute nodes g4dn.12xlarge(24 cores)
  • Network type: Elastic Fabric Adapter

Details of the problem

Encountered ORTE issue when running the command on head node(36 cores). It works when I run on a compute node though.

mpirun --hostfile host_file_with_8_hosts --map-by ppr:1:node hostname
...
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[32679,0],0] on node ip-172-31-45-184
  Remote daemon: [[32679,0],1] on node queue-g4dn12xlarge-st-g4dn12xlarge-1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

I manually verified two-way tcp traffic by ssh'ing from each compute node to head node. Did not see any issue.

However, I was able to mitigate by providing --bind-to core/socket.

I got hints from this post and followed @rhc54 's suggestion to add debugging info. Finally found something interesting. This line failed.

I have to provide these flags:

 --mca orte_debug_daemons 1 --mca orte_odls_base_verbose 100 --mca orte_state_base_verbose 100 --mca oob_base_verbose 100 --mca rml_base_verbose 100

In the log, the problematic rank seemingly died around here

...
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] Message posted at grpcomm_direct.c:628 for tag 1
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] orted_cmd: received add_local_procs
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml:base:send_buffer_nb() to peer [[23399,0],0] through conduit 0
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] OOB_SEND: rml_oob_send.c:265
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] ACTIVATE JOB NULL STATE NEVER LAUNCHED AT base/odls_base_default_fns.c:827
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] ACTIVATE JOB NULL STATE FORCED EXIT AT errmgr_default_orted.c:256
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 50
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 51
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 6
...

However, on a successful run(by --bind-to core) the log is different

...
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] Message posted at grpcomm_direct.c:628 for tag 1
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] orted_cmd: received add_local_procs
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] local:launch
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] odls:dispatch [[17660,1],5] to thread 0
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] odls:launch spawning child [[17660,1],5]
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] 
 	Env[62]: OMPI_MCA_orte_top_session_dir=/tmp/ompi.queue-g4dn12xlarge-st-g4dn12xlarge-6.1000
 	Env[63]: OMPI_MCA_orte_jobfam_session_dir=/tmp/ompi.queue-g4dn12xlarge-st-g4dn12xlarge-6.1000/jf.17660
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] ACTIVATE PROC [[17660,1],5] STATE RUNNING AT base/odls_base_default_fns.c:1052
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] ACTIVATE JOB [17660,1] STATE LOCAL LAUNCH COMPLETE AT state_orted.c:297
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] rml:base:send_buffer_nb() to peer [[17660,0],0] through conduit 0
...

I'm not clear on the ORTE side, would appreciate some pointers to understand the exact problem.

TIA!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions