Skip to content

OMPI 4.0.1 TCP connection errors beyond 86 nodes #6786

Closed
@dfaraj

Description

@dfaraj

Thank you for taking the time to submit an issue!

Background information

we have an OPA cluster of 288 nodes. All nodes run same OS image, have passwordless ssh setup and firewall is disabled. We run basic OSU osu_mbw_mr tests on 2, 4, ...86 nodes and tests complete successfully. Once we hit 88+ nodes we get

ORTE has lost communication with a remote daemon.

  HNP daemon   : [[63011,0],0] on node r1i2n13
  Remote daemon: [[63011,0],40] on node r1i3n17

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   r1i2n13
  target node:  r1i2n14

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

downloaded 4.0.1 from openmpi site,

./configure --prefix=/store/dfaraj/SW/packages/ompi/4.0.1 CC=icc CXX=icpc FC=ifort --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default --with-psm2=/usr --without-verbs --without-psm --without-knem --without-slurm --without-ucx

Please describe the system on which you are running

  • Operating system/version: RH 7.6
  • Computer hardware: dual socket Xeon nodes
  • Network type: OPA

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

when we run:
n=86

mpirun -mca  -x PATH -x LD_LIBRARY_PATH  -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it works fine,
n=88

mpirun -mca  -x PATH -x LD_LIBRARY_PATH  -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

we get the tcp error described earlier.
if I do:
n=88

mpirun -mca  -x PATH -x LD_LIBRARY_PATH --mca  plm_rsh_no_tree_spawn 1 -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it works.
if I set
n=160

mpirun -mca  -x PATH -x LD_LIBRARY_PATH --mca  plm_rsh_no_tree_spawn 1 -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it hangs, I dont think thu it is hanging, it is likely doing ssh to everyone and going so slow

EDIT: Put in proper verbatim markup

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions