Skip to content

4.1.0: mpirun crashes when trying to schedule a task on a foreign host #8596

Open
@amckinstry

Description

@amckinstry

This is on OpenMPI 4.1.0 on Debian.

Background information

mpirun crashes when trying to schedule a task on a foreign host:

$ mpirun --host bob hostname
[alice:705956] [[31919,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/odls/base/odls_base_default_fns.c at line 226
[alice:705956] [[31919,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/plm/base/plm_base_launch_support.c at line 552
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[31919,0],0] FORCE-TERMINATE AT (null):1 - error ../../../../../orte/mca/plm/base/plm_base_launch_support.c(553)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

Here, the mpirun command was issued on computer "alice" and "bob" is a foreign
host reachable via ssh.

Steps to reproduce:

I originally encountered this issue on a small cluster (that I am currently
setting up). But, I was able to reproduce this locally by setting up two lxc
containers. Thus, the following should work to reproduce the issue:

  • use two debian computers with a local user that can ssh (via pubkey) from
    one machine to another

  • make sure that no firewall drops packets between the two.

  • install openmpi-bin and run

    mpirun --host hostname

What was the outcome of this action?

An internal error in ORTE terminated mpirun.

What outcome did you expect instead?

mpirun --host bob should print "bob" and succeed (or complain loudly that I am
using it wrongly).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions