Open
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.0rc7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From a source tarball.
configured with:
../configure CFLAGS="-g" --prefix=${PREFIX} --with-ucx=${UCX_PREFIX} --disable-man-pages --with-pmix=internal --with-hwloc=internal --with-libevent=internal --without-hcoll
Please describe the system on which you are running
- Operating system/version: Linux 3.10.0-514.26.2.el7.x86_64
- Computer hardware: Intel Xeon Gold 6154 x 2 (36 cores in total)
- Network type: InfiniBand EDR 4x (100Gbps)
Details of the problem
MPI_Get
causes an internal error under a specific condition.
Error I got:
[1660029730.372314] [sca1282:136709:0] ib_md.c:379 UCX ERROR ibv_exp_reg_mr(address=0x2afde7ec4000, length=4096, access=0xf) failed: Resource temporarily unavailable
[1660029730.372354] [sca1282:136709:0] ucp_mm.c:143 UCX ERROR failed to register address 0x2afde7ec4000 mem_type bit 0x1 length 4096 on md[4]=mlx5_0: Input/output error (md reg_mem_types 0x1)
[1660029730.372365] [sca1282:136709:0] ucp_request.c:356 UCX ERROR failed to register user buffer datatype 0x8 address 0x2afde7ec4000 len 4096: Input/output error
[sca1282:136709] ../../../../../opal/mca/common/ucx/common_ucx_wpool.h:376 Error: ucp_get_nbi failed: -3
[sca1282:00000] *** An error occurred in MPI_Get
[sca1282:00000] *** reported by process [2902982657,0]
[sca1282:00000] *** on win ucx window 3
[sca1282:00000] *** MPI_ERR_OTHER: known error not in list
[sca1282:00000] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[sca1282:00000] *** and MPI will try to terminate your MPI job as well)
Here is minimal code to reproduce this error:
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, nranks;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
/* size_t array_size = (size_t)128 * 1024 * 1024; // OK */
size_t array_size = (size_t)1024 * 1024 * 1024; // error
/* size_t block_size = 1024; // OK */
/* size_t block_size = 2048; // OK */
size_t block_size = 4096; // error
/* size_t block_size = 8192; // error */
/* size_t block_size = 16384; // error */
/* size_t block_size = 32768; // OK */
/* size_t block_size = 65536; // OK */
size_t local_size = array_size / nranks;
/* char* buf = (char*)aligned_alloc(2048, array_size); // OK */
char* buf = (char*)aligned_alloc(4096, array_size); // error
void* baseptr;
MPI_Win win;
MPI_Win_allocate(local_size,
1,
MPI_INFO_NULL,
MPI_COMM_WORLD,
&baseptr,
&win);
MPI_Win_lock_all(0, win);
/* int interleave = 0; // OK */
int interleave = 1; // error
if (rank == 0) {
for (size_t i = 0; i < array_size / block_size; i++) {
int target_rank;
size_t target_disp;
if (interleave) {
target_rank = i % nranks;
target_disp = i / nranks * block_size;
} else {
target_rank = i * block_size / local_size;
target_disp = i * block_size - target_rank * local_size;
}
if (target_rank != rank) {
MPI_Get(buf + i * block_size,
block_size,
MPI_BYTE,
target_rank,
target_disp,
block_size,
MPI_BYTE,
win);
}
}
MPI_Win_flush_all(win);
}
MPI_Win_unlock_all(win);
MPI_Win_free(&win);
free(buf);
MPI_Finalize();
return 0;
}
In this code, rank 0 gathers data from all other ranks into a single local array.
MPI_Get
is issued for each rank in the granularity of block_size
.
The above error happens under the following conditions:
- only when the array size (
array_size
) is large enough (in this case 1GB) - only when the block size (
block_size
) for eachMPI_Get
is 4096, 8192, or 16384 - only when the local array (
buf
) is aligned with 4096 bytes - only with the interleave policy (
interleave=1
means that rank0 chooses the target rank forMPI_Get
in a round-robin fashion) - only when two or more processes are spawned on different nodes (not intra-node)
Otherwise, the error did not happen in my environment.
Compile the code (test.c
) and run on 2 nodes (1 process/node):
mpicc test.c
mpirun -n 2 -N 1 ./a.out