Open
Description
This issue is related to: #5818
I am encountering
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected. This is highly unusual.
The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).
Local host: nid002292
Local PID: 1273838
Peer hostname: nid002293 ([[9279,0],1])
Source IP of socket: 10.249.13.210
Known IPs of peer:
10.100.20.22
128.55.69.127
10.249.13.209
10.249.36.5
10.249.34.5
when running the CP2K container (https://catalog.ngc.nvidia.com/orgs/hpc/containers/cp2k) on NERSC's Perlmutter system(https://docs.nersc.gov/systems/perlmutter/architecture/). We've tried OpenMPI v4.1.2rc2 and v4.1.5
Background
Perlmutter's GPU nodes have 4 NICS, each has a private IP address, and one NIC (the one corresponding to the hsn0
interface has an additional public IP addres -- therefore each node has one NIC with two addresses, and these addresses are in different subnets. Eg.:
blaschke@nid200257:~> ip -4 -f inet addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: nmn0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
altname enp195s0
inet 10.100.108.32/22 brd 10.100.111.255 scope global nmn0
valid_lft forever preferred_lft forever
3: hsn0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
altname enp194s0
inet 10.249.41.248/16 brd 10.249.255.255 scope global hsn0
valid_lft forever preferred_lft forever
inet 128.55.84.128/19 brd 128.55.95.255 scope global hsn0:chn
valid_lft forever preferred_lft forever
4: hsn1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
altname enp129s0
inet 10.249.41.232/16 brd 10.249.255.255 scope global hsn1
valid_lft forever preferred_lft forever
5: hsn2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
altname enp66s0
inet 10.249.41.231/16 brd 10.249.255.255 scope global hsn2
valid_lft forever preferred_lft forever
6: hsn3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
altname enp1s0
altname ens3
inet 10.249.41.247/16 brd 10.249.255.255 scope global hsn3
valid_lft forever preferred_lft forever
(the example above also shows the node management network nmn
interace -- but MPIshouldn't be talking to that anyway)
I think the error must be caused by hsn0
's two ip addresses on two subnets.