Description
Background information
For certain research reasons, I need to use Open MPI with the pml ob1 in conjunction with Infiniband (ucx/uct as btl) on our cluster. This works largely without any problems.
In my particular program I am trying to split a matrix into sub-matrices (using Derived Datatypes) and distribute them to all processes using scatter.
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
Version 5.0.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
+6f81bfd163f3275d2b0630974968c82759dd4439 3rd-party/openpmix (v1.1.3-3983-g6f81bfd1)
+4f27008906d96845e22df6502d6a9a29d98dec83 3rd-party/prrte (psrvr-v2.0.0rc1-4746-g4f27008906)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)
Please describe the system on which you are running
- Operating system/version: NixOs
- Computer hardware: Cluster with AMD Opteron Processors 6272
- Network type: Infiniband (mlx4_0) / Ethernet
Details of the problem
The scatter always ends in a deadlock if the matrix is selected to be correspondingly large (here in the example 900x900, size 90 still works) and ob1 is used in conjunction with uct.
(I am using slurm scripts to distribute all jobs)
mpirun -n 81 --mca pml ucx ./test --> works
mpirun -n 81 --mca pml ob1 ./test --> deadlock
mpirun -n 81 --mca pml ob1 --mca btl ^uct ./test --> works
If I use normal MPI data types (e.g. MPI_DOUBLE) instead of the Derived Datatypes, everything also works with uct. So the problem is definitely with the derived datatypes that are to be used.
MPI Program:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int N = 900; //Matrix-Dimension
int block_size = N / (int)sqrt(size);
int *matrix = NULL;
if (rank == 0) {
matrix = (int *)malloc(N * N * sizeof(int));
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
matrix[i * N + j] = i * N + j;
}
}
}
MPI_Datatype blocktype, blocktype_resized;
int array_of_sizes[2] = {N, N};
int array_of_subsizes[2] = {block_size, block_size};
int array_of_starts[2] = {0, 0};
MPI_Type_create_subarray(2, array_of_sizes, array_of_subsizes, array_of_starts, MPI_ORDER_C, MPI_INT, &blocktype);
MPI_Type_create_resized(blocktype, 0, block_size * sizeof(int), &blocktype_resized);
MPI_Type_commit(&blocktype_resized);
int *recvbuf = (int *)malloc(block_size * block_size * sizeof(int));
MPI_Scatter(matrix, 1, blocktype_resized, recvbuf, block_size * block_size, MPI_INT, 0, MPI_COMM_WORLD);
free(recvbuf);
if (rank == 0) {
free(matrix);
}
MPI_Type_free(&blocktype_resized);
MPI_Finalize();
return 0;
}