Skip to content

ORTE has lost communication with a remote daemon. #6618

Closed
@tingweiwu

Description

@tingweiwu

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

 mpirun --version
mpirun.real (OpenRTE) 3.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Install Open MPI

 mkdir /tmp/openmpi && \
    cd /tmp/openmpi && \
    wget https://www.open-mpi.org/software/ompi/v3.1/downloads/openmpi-3.1.2.tar.gz && \
    tar zxf openmpi-3.1.2.tar.gz && \
    cd openmpi-3.1.2 && \
    ./configure --enable-orterun-prefix-by-default && \
    make -j $(nproc) all && \
    make install && \
    ldconfig && \
    rm -rf /tmp/openmpi

Please describe the system on which you are running

  • Operating system/version:
    ubuntu16.04
  • Computer hardware:
    V100GPU+InfiniBand
  • Network type:
    docker cni networker

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

I got this error frequently, not everytime. but it occures both when the process starting or running.

I have check the network bewteen i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd and i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0 is ok, and OOM haven't occured.

do you have any suggestion to find the reason?

+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 1 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 3 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 2 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 4 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 5 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 6 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 7 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 8 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
command terminated with exit code 137
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[11377,0],0] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd
Remote daemon: [[11377,0],1] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[11377,0],0] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd
Remote daemon: [[11377,0],1] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions