Description
Thank you for taking the time to submit an issue!
Background information
we have an OPA cluster of 288 nodes. All nodes run same OS image, have passwordless ssh setup and firewall is disabled. We run basic OSU osu_mbw_mr tests on 2, 4, ...86 nodes and tests complete successfully. Once we hit 88+ nodes we get
ORTE has lost communication with a remote daemon.
HNP daemon : [[63011,0],0] on node r1i2n13
Remote daemon: [[63011,0],40] on node r1i3n17
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: r1i2n13
target node: r1i2n14
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
4.0.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
downloaded 4.0.1 from openmpi site,
./configure --prefix=/store/dfaraj/SW/packages/ompi/4.0.1 CC=icc CXX=icpc FC=ifort --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default --with-psm2=/usr --without-verbs --without-psm --without-knem --without-slurm --without-ucx
Please describe the system on which you are running
- Operating system/version: RH 7.6
- Computer hardware: dual socket Xeon nodes
- Network type: OPA
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
when we run:
n=86
mpirun -mca -x PATH -x LD_LIBRARY_PATH -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi
it works fine,
n=88
mpirun -mca -x PATH -x LD_LIBRARY_PATH -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi
we get the tcp error described earlier.
if I do:
n=88
mpirun -mca -x PATH -x LD_LIBRARY_PATH --mca plm_rsh_no_tree_spawn 1 -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi
it works.
if I set
n=160
mpirun -mca -x PATH -x LD_LIBRARY_PATH --mca plm_rsh_no_tree_spawn 1 -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi
it hangs, I dont think thu it is hanging, it is likely doing ssh to everyone and going so slow
EDIT: Put in proper verbatim markup