Skip to content

in-place MPI_Alltoallw crashes  #9329

Closed
@rabauke

Description

@rabauke

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI 4.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Downloaded sources from https://www.open-mpi.org/software/ompi/v4.1/ and compiled with

./configure --enable-mem-debug --enable-mem-profile --enable-debug

on Ubuntu 20.04 x64.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 20.04
  • Computer hardware: x64 laptop
  • Network type: no network

Details of the problem

The in-place variant of MPI_Alltoallw crashes as demonstrated by the following test program

#include "mpi.h"
#include <vector>

int main() {
  MPI_Init(nullptr, nullptr);

  int size, rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  std::vector<double> v(size, rank);

  std::vector<MPI_Datatype> types;
  for (int i{0}; i < size; ++i) {
    const int length[1] = {1};
    const int displacement[1] = {i};
    MPI_Datatype new_type;
    MPI_Type_indexed(1, length, displacement, MPI_DOUBLE, &new_type);
    MPI_Type_commit(&new_type);
    types.push_back(new_type);
  }

  std::vector<int> counts(size, 1);
  std::vector<int> displacements(size, 0);

  MPI_Alltoallw(MPI_IN_PLACE, nullptr, nullptr, nullptr, v.data(), counts.data(),
                displacements.data(), types.data(), MPI_COMM_WORLD);

  MPI_Finalize();
}

The above program essentially implements a standard in-place MPI_Alltoall and is not particularly useful. It is just for demonstration purposes. Running the program with

$ mpirun -np 4 debug 

yields

[tron:93567] pmix_mca_base_component_repository_open: unable to open mca_pnet_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[tron:93567] pmix_mca_base_component_repository_open: unable to open mca_pnet_test: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
free(): invalid pointer
[tron:93571] *** Process received signal ***
[tron:93571] Signal: Aborted (6)
[tron:93571] Signal code:  (-6)
double free or corruption (out)
[tron:93572] *** Process received signal ***
[tron:93572] Signal: Aborted (6)
[tron:93572] Signal code:  (-6)
[tron:93572] [ 0] double free or corruption (out)
[tron:93574] *** Process received signal ***
[tron:93574] Signal: Aborted (6)
[tron:93574] Signal code:  (-6)
[tron:93574] [ 0] [tron:93571] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fcb19341210]
[tron:93571] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f065cef2210]
[tron:93574] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fb7f6a2b210]
[tron:93572] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fb7f6a2b18b]
[tron:93572] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fcb1934118b]
[tron:93571] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f065cef218b]
[tron:93574] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fcb19320859]
[tron:93571] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fb7f6a0a859]
[tron:93572] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f065ced1859]
[tron:93574] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7fcb1938b3ee]
[tron:93571] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7f065cf3c3ee]
[tron:93574] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7fb7f6a753ee]
[tron:93572] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7fcb1939347c]
[tron:93571] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x99cac)[0x7fcb19394cac]
[tron:93571] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7fb7f6a7d47c]
[tron:93572] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7f065cf4447c]
[tron:93574] [ 5] /usr/local/lib/libopen-pal.so.40(opal_free+0x23)[0x7fcb190f5974]
[tron:93571] [ 7] /usr/local/lib/openmpi/mca_coll_basic.so(+0x473d)[0x7fcb1815073d]
[tron:93571] [ 8] /usr/local/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x9b)[0x7fcb181507f1]
[tron:93571] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x9a120)[0x7f065cf46120]
[tron:93574] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x9a120)[0x7fb7f6a7f120]
[tron:93572] [ 6] /usr/local/lib/libmpi.so.40(PMPI_Alltoallw+0x5ad)[0x7fcb1979eebd]
[tron:93571] [10] debug(+0x1510)[0x556b2d686510]
[tron:93571] [11] /usr/local/lib/libopen-pal.so.40(opal_free+0x23)[0x7fb7f67df974]
[tron:93572] [ 7] /usr/local/lib/openmpi/mca_coll_basic.so(+0x473d)[0x7fb7f443873d]
/usr/local/lib/libopen-pal.so.40(opal_free+0x23)[0x7f065cca6974]
[tron:93574] [ 7] /usr/local/lib/openmpi/mca_coll_basic.so(+0x473d)[0x7f06568fd73d]
[tron:93574] [ 8] /usr/local/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x9b)[0x7f06568fd7f1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fcb193220b3]
[tron:93571] [12] debug(+0x122e)[0x556b2d68622e]
[tron:93571] *** End of error message ***
[tron:93572] [ 8] /usr/local/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x9b)[0x7fb7f44387f1]
[tron:93572] [ 9] [tron:93574] [ 9] /usr/local/lib/libmpi.so.40(PMPI_Alltoallw+0x5ad)[0x7fb7f6e88ebd]
[tron:93572] [10] debug(+0x1510)[0x55a7339ab510]
[tron:93572] [11] /usr/local/lib/libmpi.so.40(PMPI_Alltoallw+0x5ad)[0x7f065d34febd]
[tron:93574] [10] debug(+0x1510)[0x55b4c76f8510]
[tron:93574] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fb7f6a0c0b3]
[tron:93572] [12] debug(+0x122e)[0x55a7339ab22e]
[tron:93572] *** End of error message ***
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f065ced30b3]
[tron:93574] [12] debug(+0x122e)[0x55b4c76f822e]
[tron:93574] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node tron exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions