Open
Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from a source/distribution tarball
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
Please describe the system on which you are running
- Operating system/version: CentOS Linux release 7.6.1810 (AltArch) Linux version 4.14.0-115.el7a.0.1.aarch64 (mockbuild@aarch64-01.bsys.centos.org)
- Computer hardware:
[nscc-gz@centos203 examples]$ lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 4
Model: 0
BogoMIPS: 200.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 65536K
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
NUMA node2 CPU(s): 64-95
NUMA node3 CPU(s): 96-127
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop
- Network type:
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Hi,When I running the hello_c,I get the following output
[nscc-gz@centos203 examples]$ mpirun -np 4 --mca orte_base_help_aggregate 0
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: centos203
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: centos203
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: centos203
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: centos203
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
Hello, world, I am 0 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 1 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 2 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 3 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
and the ibstat output
[nscc-gz@centos203 examples]$ ibstat
CA 'mlx5_0'
CA type: MT4117
Number of ports: 1
Firmware version: 14.20.1820
Hardware version: 0
Node GUID:
System image GUID:
Port 1:
State: Active
Physical state: LinkUp
Rate: 25
Base lid: 0
LMC: 0
SM lid: 0
Capability mask:
Port GUID:
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4117
Number of ports: 1
Firmware version: 14.20.1820
Hardware version: 0
Node GUID:
System image GUID:
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask:
Port GUID:
Link layer: Ethernet
if I use this command
[nscc-gz@centos203 examples]$ mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm -np 4 hello_c
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: centos203
Local device:
Local port: 1
CPCs attempted: rdmacm
--------------------------------------------------------------------------
Hello, world, I am 0 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 1 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 2 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 3 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
[centos203:10977] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[centos203:10977] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
And if i designate the ib device
[nscc-gz@centos203 examples]$ mpirun -np 4 ./hello_c --mca btl_openib_if_exclude mlx5_0
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: centos203
--------------------------------------------------------------------------
Hello, world, I am 0 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 1 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 2 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 3 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
[centos203:14896] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found
[centos203:14896] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[nscc-gz@centos203 examples]$
The ifconfig and ib port
[nscc-gz@centos203 examples]$ ibdev2netdev
mlx5_0 port 1 ==> enp1s0f0 (Up)
mlx5_1 port 1 ==> enp1s0f1 (Down)
[nscc-gz@centos203 examples]$ ifconfig
enp125s0f0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether txqueuelen 1000 (Ethernet)
RX packets 1951751285 bytes 2352472322729 (2.1 TiB)
RX errors 0 dropped 11718888 overruns 0 frame 0
TX packets 822856179 bytes 1385364963277 (1.2 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp125s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.29.130 netmask 255.255.255.0 broadcast 172.16.29.255
inet6 prefixlen 64 scopeid 0x20<link>
ether txqueuelen 1000 (Ethernet)
RX packets 19347918 bytes 7289410117 (6.7 GiB)
RX errors 0 dropped 2958451 overruns 0 frame 0
TX packets 12963627 bytes 48203399135 (44.8 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp125s0f2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp125s0f3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp1s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.40.1.203 netmask 255.255.255.0 broadcast 10.40.1.255
inet6 prefixlen 64 scopeid 0x20<link>
ether txqueuelen 1000 (Ethernet)
RX packets 382158355530 bytes 544487865896139 (495.2 TiB)
RX errors 208 dropped 3083040 overruns 0 frame 208
TX packets 379357423669 bytes 545429402206655 (496.0 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp1s0f1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 32048729965 bytes 809795856471103 (736.5 TiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 32048729965 bytes 809795856471103 (736.5 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[nscc-gz@centos203 examples]$
could you tell me how can I use the IB devices correctly? Thanks!