[4.1.5] ORTE has lost communication with a remote daemon

## Background information

### What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.5 relese tarball, configured with
```
configure_options --with-sge --without-verbs --disable-builtin-atomics --with-libfabric/opt/amazon/efa --enable-orterun-prefix-by-default
```


### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

4.1.5 release tarball

### If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

N/A

### Please describe the system on which you are running

* Operating system/version: Amazon Linux 2
* Computer hardware:EC2. Head node c5.18xlarge(36 cores), Compute nodes g4dn.12xlarge(24 cores)
* Network type: Elastic Fabric Adapter

-----------------------------

## Details of the problem

Encountered ORTE issue when running the command on head node(36 cores). It works when I run on a compute node though.
```
mpirun --hostfile host_file_with_8_hosts --map-by ppr:1:node hostname
...
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[32679,0],0] on node ip-172-31-45-184
  Remote daemon: [[32679,0],1] on node queue-g4dn12xlarge-st-g4dn12xlarge-1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
```

I manually verified two-way tcp traffic by ssh'ing from each compute node to head node. Did not see any issue.

However, I was able to mitigate by providing `--bind-to core/socket`.

I got hints from [this post](https://github.com/open-mpi/ompi/issues/10735#issuecomment-1233400091) and followed @rhc54 's suggestion to add debugging info. Finally found something interesting. [This line failed](https://github.com/open-mpi/ompi/blob/v4.1.x/orte/orted/orted_comm.c#L258).

I have to provide these flags:
```
 --mca orte_debug_daemons 1 --mca orte_odls_base_verbose 100 --mca orte_state_base_verbose 100 --mca oob_base_verbose 100 --mca rml_base_verbose 100
```

In the log, the problematic rank seemingly died around here
```
...
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] Message posted at grpcomm_direct.c:628 for tag 1
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] orted_cmd: received add_local_procs
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml:base:send_buffer_nb() to peer [[23399,0],0] through conduit 0
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] OOB_SEND: rml_oob_send.c:265
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] ACTIVATE JOB NULL STATE NEVER LAUNCHED AT base/odls_base_default_fns.c:827
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] ACTIVATE JOB NULL STATE FORCED EXIT AT errmgr_default_orted.c:256
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 50
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 51
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 6
...
```

However, on a successful run(by `--bind-to core`) the log is different
```
...
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] Message posted at grpcomm_direct.c:628 for tag 1
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] orted_cmd: received add_local_procs
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] local:launch
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] odls:dispatch [[17660,1],5] to thread 0
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] odls:launch spawning child [[17660,1],5]
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] 
 	Env[62]: OMPI_MCA_orte_top_session_dir=/tmp/ompi.queue-g4dn12xlarge-st-g4dn12xlarge-6.1000
 	Env[63]: OMPI_MCA_orte_jobfam_session_dir=/tmp/ompi.queue-g4dn12xlarge-st-g4dn12xlarge-6.1000/jf.17660
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] ACTIVATE PROC [[17660,1],5] STATE RUNNING AT base/odls_base_default_fns.c:1052
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] ACTIVATE JOB [17660,1] STATE LOCAL LAUNCH COMPLETE AT state_orted.c:297
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] rml:base:send_buffer_nb() to peer [[17660,0],0] through conduit 0
...
```


I'm not clear on the ORTE side, would appreciate some pointers to understand the exact problem.

TIA!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4.1.5] ORTE has lost communication with a remote daemon #11830

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[4.1.5] ORTE has lost communication with a remote daemon #11830

Description

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.