Skip to content

System call failure: unlink during MPI_Finalize() #9905

Open
@afborchert

Description

@afborchert

Background information

What version of Open MPI are you using?

4.1.2

Describe how Open MPI was installed

Downloaded from https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.2.tar.bz2, unpacked, and built with following script:

cd openmpi-4.1.2 &&
env \
   FC=/opt/ulm/dublin/cmd/gfortran \
   CC=/opt/ulm/dublin/cmd/gcc \
   CXX=/opt/ulm/dublin/cmd/g++ \
./configure --prefix=/opt/ulm/dublin \
   --disable-silent-rules \
   --libdir=/opt/ulm/dublin/lib/amd64 \
   --enable-wrapper-rpath &&
make DESTDIR=/home/pkgdev/dublin/openmpi/proto install

Please describe the system on which you are running

  • Operating system/version: Solaris 11.4
  • Computer hardware: Intel Xeon CPU E5-2650 v4
  • Network type: 1 Gbit/s Ethernet

Details of the problem

Intermittently, even most simple MPI applications that are run locally with shared memory fail at MPI_Finalize() with errors as following:

--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  theon
  System call: unlink(2) /tmp/ompi.theon.120/pid.12048/1/vader_segment.theon.120.675b0001.1
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------

This happens even for most trivial test programs like the following:

#include <stdio.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);

    int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    int nof_processes; MPI_Comm_size(MPI_COMM_WORLD, &nof_processes);

    if (rank) {
        MPI_Send(&rank, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
    } else {
        for (int i = 0; i + 1 < nof_processes; ++i) {
            MPI_Status status;
            int msg;
            MPI_Recv(&msg, 1, MPI_INT, MPI_ANY_SOURCE,
                0, MPI_COMM_WORLD, &status);
            int count;
            MPI_Get_count(&status, MPI_INT, &count);
            if (count == 1) {
                printf("%d\n", msg);
            }
        }
    }

    MPI_Finalize();
}

Just run mpirun multiple times and it will eventually fail:

theon$ mpicc -o mpi-test mpi-test.c
theon$ mpirun -np 4 mpi-test
3
1
2
theon$ mpirun -np 4 mpi-test
3
2
1
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  theon
  System call: unlink(2) /tmp/ompi.theon.120/pid.13340/1/vader_segment.theon.120.7c570001.1
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
theon$ 

We had the very same problem with Open MPI 4.1.1. We had no such problems with Open MPI 2.1.6.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions