Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
main and v5.0.x branches
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
./configure ... --with-libfabric=<libfabric that has prov/shm enabled>
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
Submodules are not relevant in this case, but we need a new libfabric with this commit: ofiwg/libfabric@7e84ace#diff-9a108fdcddc323ee5ec91488c1fbdd907d733c960c0bdcddd507803ec1bf3081
Please describe the system on which you are running
- Operating system/version: Tested on Amazon Linux 2, but it's not OS specific
- Computer hardware: hpc6a.48xlarge EC2
- Network type: EFA
Details of the problem
The problem can be revealed by one-sided applications. We reproduced with Intel Microbenchmark
$ mpirun --map-by ppr:1:node -n 2 --hostfile hostfile --mca btl_ofi_verbose 1 --mca btl ^tcp mpi-benchmarks-IMB-v2021.7/IMB-RMA All_put_all
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:587: mtl:ofi:provider_include = "(null)"
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:590: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream,usnic,net"
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:725: EFA specific fi_getinfo(): No data available
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:765: fi_getinfo(): No data available
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:725: EFA specific fi_getinfo(): Success
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:344: mtl:ofi:provider: efa
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:369: mtl:ofi:provider:domain: rdmap0s6-rdm
[ip-172-31-20-174.us-east-2.compute.internal:35762] btl_ofi_component.c:308: btl:ofi:provider_include = "(null)"
[ip-172-31-20-174.us-east-2.compute.internal:35762] btl_ofi_component.c:310: btl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream,usnic,net"
[ip-172-31-20-174.us-east-2.compute.internal:35762] btl_ofi_component.c:69: btl:ofi: "shm" in exclude list
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:587: mtl:ofi:provider_include = "(null)"
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:590: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream,usnic,net"
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:725: EFA specific fi_getinfo(): No data available
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:765: fi_getinfo(): No data available
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:725: EFA specific fi_getinfo(): Success
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:344: mtl:ofi:provider: efa
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:369: mtl:ofi:provider:domain: rdmap0s6-rdm
[ip-172-31-24-238.us-east-2.compute.internal:32618] btl_ofi_component.c:308: btl:ofi:provider_include = "(null)"
[ip-172-31-24-238.us-east-2.compute.internal:32618] btl_ofi_component.c:310: btl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream,usnic,net"
[ip-172-31-24-238.us-east-2.compute.internal:32618] btl_ofi_component.c:69: btl:ofi: "shm" in exclude list
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.7, MPI-RMA part
#----------------------------------------------------------------
# Date : Sat Jan 13 02:34:41 2024
# Machine : x86_64
# System : Linux
# Release : 5.10.205-195.804.amzn2.x86_64
# Version : #1 SMP Fri Jan 5 01:22:18 UTC 2024
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# mpi-benchmarks-IMB-v2021.7/IMB-RMA All_put_all
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# All_put_all
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[13163,1],1]) is on host: ip-172-31-24-238
Process 2 ([[13163,1],0]) is on host: ip-172-31-20-174
BTLs attempted: self
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[ip-172-31-24-238:32618] *** Process received signal ***
[ip-172-31-24-238:32618] Signal: Segmentation fault (11)
[ip-172-31-24-238:32618] Signal code: Address not mapped (1)
[ip-172-31-24-238:32618] Failing at address: 0xb8
[ip-172-31-24-238:32618] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7f65ada1d8e0]
[ip-172-31-24-238:32618] [ 1] /opt/amazon/openmpi5/lib64/openmpi/mca_osc_rdma.so(+0x22120)[0x7f659f2a0120]
[ip-172-31-24-238:32618] [ 2] /opt/amazon/openmpi5/lib64/openmpi/mca_osc_rdma.so(ompi_osc_rdma_new_peer+0x49)[0x7f659f2a06c9]
[ip-172-31-24-238:32618] [ 3] /opt/amazon/openmpi5/lib64/openmpi/mca_osc_rdma.so(ompi_osc_rdma_peer_lookup+0x87)[0x7f659f2a0907]
[ip-172-31-24-238:32618] [ 4] /opt/amazon/openmpi5/lib64/openmpi/mca_osc_rdma.so(+0x1b239)[0x7f659f299239]
[ip-172-31-24-238:32618] [ 5] /opt/amazon/openmpi5/lib64/libmpi.so.40(ompi_osc_base_select+0x13b)[0x7f65ae5eb13b]
[ip-172-31-24-238:32618] [ 6] /opt/amazon/openmpi5/lib64/libmpi.so.40(ompi_win_create+0x93)[0x7f65ae5642c3]
[ip-172-31-24-238:32618] [ 7] /opt/amazon/openmpi5/lib64/libmpi.so.40(MPI_Win_create+0xc8)[0x7f65ae5aa498]
[ip-172-31-24-238:32618] [ 8] mpi-benchmarks-IMB-v2021.7/IMB-RMA[0x44b8f5]
[ip-172-31-24-238:32618] [ 9] mpi-benchmarks-IMB-v2021.7/IMB-RMA[0x42d65d]
[ip-172-31-24-238:32618] [10] mpi-benchmarks-IMB-v2021.7/IMB-RMA[0x436f01]
[ip-172-31-24-238:32618] [11] mpi-benchmarks-IMB-v2021.7/IMB-RMA[0x405a6e]
[ip-172-31-24-238:32618] [12] /lib64/libc.so.6(__libc_start_main+0xea)[0x7f65ad68013a]
[ip-172-31-24-238:32618] [13] mpi-benchmarks-IMB-v2021.7/IMB-RMA[0x40442a]
[ip-172-31-24-238:32618] *** End of error message ***
This is because with the libfabric change the shm provider also satisfies btl/ofi's requirement, i.e. FI_HMEM | FI_ATOMIC | FI_RMA
, and it was later ignored because it is on the exclusion list. As a result, btl/ofi did not select any provider :(
In this case, the user's attention was to use another provider, e.g. efa, that does not support FI_HMEM
, but that didn't happen because shm was returned by fi_getinfo(...)
first.
This behavior was introduced in 5.0.x due to the optional FI_HMEM
check.
Proposed solution
-
I think we should refactor both mtl/ofi and btl/ofi provider selection logic, and respect
{mtl/btl}_ofi_provider_{exclude/exclude}
MCA parameter. Specifically, right after eachfi_getinfo(...)
call, we should first apply the include/exclude filter, and return error if no qualified provider is found. -
For this particular problem, I'm surprised that shm was selected at all, since it only supports intra-node communication. I wonder whether we should also request
FI_REMOTE_COMM | FI_LOCAL_COMM
, the same as mtl/ofi.
Mitigation
- We can force
fi_getinfo
to return specific providers with-x PROVIDER=<the desired provider>