Open
Description
This is on OpenMPI 4.1.0 on Debian.
Background information
mpirun crashes when trying to schedule a task on a foreign host:
$ mpirun --host bob hostname
[alice:705956] [[31919,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/odls/base/odls_base_default_fns.c at line 226
[alice:705956] [[31919,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/plm/base/plm_base_launch_support.c at line 552
--------------------------------------------------------------------------
An internal error has occurred in ORTE:
[[31919,0],0] FORCE-TERMINATE AT (null):1 - error ../../../../../orte/mca/plm/base/plm_base_launch_support.c(553)
This is something that should be reported to the developers.
--------------------------------------------------------------------------
Here, the mpirun command was issued on computer "alice" and "bob" is a foreign
host reachable via ssh.
Steps to reproduce:
I originally encountered this issue on a small cluster (that I am currently
setting up). But, I was able to reproduce this locally by setting up two lxc
containers. Thus, the following should work to reproduce the issue:
-
use two debian computers with a local user that can ssh (via pubkey) from
one machine to another -
make sure that no firewall drops packets between the two.
-
install openmpi-bin and run
mpirun --host hostname
What was the outcome of this action?
An internal error in ORTE terminated mpirun.
What outcome did you expect instead?
mpirun --host bob should print "bob" and succeed (or complain loudly that I am
using it wrongly).