Closed
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI 4.1.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Downloaded sources from https://www.open-mpi.org/software/ompi/v4.1/ and compiled with
./configure --enable-mem-debug --enable-mem-profile --enable-debug
on Ubuntu 20.04 x64.
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
Please describe the system on which you are running
- Operating system/version: Ubuntu 20.04
- Computer hardware: x64 laptop
- Network type: no network
Details of the problem
The in-place variant of MPI_Alltoallw
crashes as demonstrated by the following test program
#include "mpi.h"
#include <vector>
int main() {
MPI_Init(nullptr, nullptr);
int size, rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
std::vector<double> v(size, rank);
std::vector<MPI_Datatype> types;
for (int i{0}; i < size; ++i) {
const int length[1] = {1};
const int displacement[1] = {i};
MPI_Datatype new_type;
MPI_Type_indexed(1, length, displacement, MPI_DOUBLE, &new_type);
MPI_Type_commit(&new_type);
types.push_back(new_type);
}
std::vector<int> counts(size, 1);
std::vector<int> displacements(size, 0);
MPI_Alltoallw(MPI_IN_PLACE, nullptr, nullptr, nullptr, v.data(), counts.data(),
displacements.data(), types.data(), MPI_COMM_WORLD);
MPI_Finalize();
}
The above program essentially implements a standard in-place MPI_Alltoall
and is not particularly useful. It is just for demonstration purposes. Running the program with
$ mpirun -np 4 debug
yields
[tron:93567] pmix_mca_base_component_repository_open: unable to open mca_pnet_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[tron:93567] pmix_mca_base_component_repository_open: unable to open mca_pnet_test: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
free(): invalid pointer
[tron:93571] *** Process received signal ***
[tron:93571] Signal: Aborted (6)
[tron:93571] Signal code: (-6)
double free or corruption (out)
[tron:93572] *** Process received signal ***
[tron:93572] Signal: Aborted (6)
[tron:93572] Signal code: (-6)
[tron:93572] [ 0] double free or corruption (out)
[tron:93574] *** Process received signal ***
[tron:93574] Signal: Aborted (6)
[tron:93574] Signal code: (-6)
[tron:93574] [ 0] [tron:93571] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fcb19341210]
[tron:93571] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f065cef2210]
[tron:93574] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fb7f6a2b210]
[tron:93572] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fb7f6a2b18b]
[tron:93572] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fcb1934118b]
[tron:93571] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f065cef218b]
[tron:93574] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fcb19320859]
[tron:93571] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fb7f6a0a859]
[tron:93572] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f065ced1859]
[tron:93574] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7fcb1938b3ee]
[tron:93571] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7f065cf3c3ee]
[tron:93574] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7fb7f6a753ee]
[tron:93572] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7fcb1939347c]
[tron:93571] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x99cac)[0x7fcb19394cac]
[tron:93571] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7fb7f6a7d47c]
[tron:93572] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7f065cf4447c]
[tron:93574] [ 5] /usr/local/lib/libopen-pal.so.40(opal_free+0x23)[0x7fcb190f5974]
[tron:93571] [ 7] /usr/local/lib/openmpi/mca_coll_basic.so(+0x473d)[0x7fcb1815073d]
[tron:93571] [ 8] /usr/local/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x9b)[0x7fcb181507f1]
[tron:93571] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x9a120)[0x7f065cf46120]
[tron:93574] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x9a120)[0x7fb7f6a7f120]
[tron:93572] [ 6] /usr/local/lib/libmpi.so.40(PMPI_Alltoallw+0x5ad)[0x7fcb1979eebd]
[tron:93571] [10] debug(+0x1510)[0x556b2d686510]
[tron:93571] [11] /usr/local/lib/libopen-pal.so.40(opal_free+0x23)[0x7fb7f67df974]
[tron:93572] [ 7] /usr/local/lib/openmpi/mca_coll_basic.so(+0x473d)[0x7fb7f443873d]
/usr/local/lib/libopen-pal.so.40(opal_free+0x23)[0x7f065cca6974]
[tron:93574] [ 7] /usr/local/lib/openmpi/mca_coll_basic.so(+0x473d)[0x7f06568fd73d]
[tron:93574] [ 8] /usr/local/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x9b)[0x7f06568fd7f1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fcb193220b3]
[tron:93571] [12] debug(+0x122e)[0x556b2d68622e]
[tron:93571] *** End of error message ***
[tron:93572] [ 8] /usr/local/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x9b)[0x7fb7f44387f1]
[tron:93572] [ 9] [tron:93574] [ 9] /usr/local/lib/libmpi.so.40(PMPI_Alltoallw+0x5ad)[0x7fb7f6e88ebd]
[tron:93572] [10] debug(+0x1510)[0x55a7339ab510]
[tron:93572] [11] /usr/local/lib/libmpi.so.40(PMPI_Alltoallw+0x5ad)[0x7f065d34febd]
[tron:93574] [10] debug(+0x1510)[0x55b4c76f8510]
[tron:93574] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fb7f6a0c0b3]
[tron:93572] [12] debug(+0x122e)[0x55a7339ab22e]
[tron:93572] *** End of error message ***
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f065ced30b3]
[tron:93574] [12] debug(+0x122e)[0x55b4c76f822e]
[tron:93574] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node tron exited on signal 6 (Aborted).
--------------------------------------------------------------------------