Open
Description
I would like to run OpenMPI 4.1.1 on a collection of nodes using a bonded interface.
Small node counts (e.g. 6) sometimes works but larger node counts (e.g. 25) most always fail. A race condition or non deterministic behavior of some kind. Here's the error I get
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected. This is highly unusual.
The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).
Local host: node1390
Local PID: 389375
Peer hostname: node836 ([[19416,1],0])
Source IP of socket: 10.9.4.86
Known IPs of peer:
10.14.4.86
Here's node836's, node1390's and node1394's "ip addr" Note the bond on 1390 and 1394, I launch the job from node836.
(openmpi-test) [sharon.brunett@**node836 openmpi_cit_test]$ ip addr**
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
link/ether 98:40:bb:bc:90:85 brd ff:ff:ff:ff:ff:ff
inet 10.9.4.86/16 brd 10.9.255.255 scope global noprefixroute em1
valid_lft forever preferred_lft forever
inet 10.14.4.86/16 brd 10.14.255.255 scope global noprefixroute em1
valid_lft forever preferred_lft forever
3: em2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 98:40:bb:bc:90:88 brd ff:ff:ff:ff:ff:ff
**[sharon.brunett@node1390 ~]$ ip addr**
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
link/ether 0c:c4:7a:97:05:f4 brd ff:ff:ff:ff:ff:ff
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9000 qdisc mq master bond0 state DOWN group default qlen 1000
link/ether 0c:c4:7a:97:05:f5 brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 0c:c4:7a:97:05:f4 brd ff:ff:ff:ff:ff:ff
inet 10.9.6.140/16 brd 10.9.255.255 scope global bond0
valid_lft forever preferred_lft forever
inet 10.14.6.140/16 brd 10.14.255.255 scope global bond0
valid_lft forever preferred_lft forever
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
link/ether 0c:c4:7a:97:03:c0 brd ff:ff:ff:ff:ff:ff
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9000 qdisc mq master bond0 state DOWN group default qlen 1000
link/ether 0c:c4:7a:97:03:c1 brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 0c:c4:7a:97:03:c0 brd ff:ff:ff:ff:ff:ff
inet 10.9.6.144/16 brd 10.9.255.255 scope global bond0
valid_lft forever preferred_lft forever
inet 10.14.6.144/16 brd 10.14.255.255 scope global bond0
valid_lft forever preferred_lft forever
Here's the little test...if I drop the --barrier, the test does not generate errors (but I need the barrier for the larger code). I wonder why?
import sys
import logging as log
import time
import platform
from mpi4py import MPI as mpi
mpi_comm = mpi.COMM_WORLD
mpi_rank = mpi_comm.Get_rank()
log_format = "%(asctime)s [host {0} rank {1:d}] %(message)s".format(platform.node(), mpi_rank)
log.basicConfig(format=log_format, level=log.INFO)
def master():
# rank=0 will just sit waiting for data and printing it as it comes
while True:
stuff = mpi_comm.gather(None, root=0)
log.info('Got: %s', stuff)
def compute():
# rank>0 will keep sending data periodically
# at a rate proportional to (rank - 1)
iteration = 0
while True:
stuff = 'rank {:d} at iteration {:d}'.format(mpi_rank, iteration)
log.info('Sending: %s', stuff)
t0 = time.time()
mpi_comm.gather(stuff, root=0)
t1 = time.time()
log.info('gather() blocked for %f s', t1 - t0)
sleep_time = 1 * mpi_rank
log.info('Sleeping for %d s', sleep_time)
time.sleep(sleep_time)
iteration += 1
log.info('Ready')
if len(sys.argv) > 1 and sys.argv[1] == '--barrier':
mpi_comm.Barrier()
if mpi_rank == 0:
master()
else:
compute()
Thanks for your inputs on what to try next.
Sharon