Skip to content

OPAL ERROR: Error in file %s/opal/mca/pmix/s1/pmix_s1.c at line 578 #5593

@morrisonlevi

Description

@morrisonlevi

Background information

I am trying to get OpenMPI and Slurm working working with PMIx, but have been hitting issues in various places.

What versions of software are you using?

  • OpenMPI 3.1.2
  • Slurm 17.11.9
  • PMIx 2.1.3

All three of these were built from source.

PMIx is installed to /usr/local/pmix/2.1 and there is an ld.so.conf.d file that includes its libdir in the ld.so.cache.

Slurm is installed to /usr/include/slurm, /usr/lib64/slurm, etc.

OpenMPI is configured with:

'--prefix=/apps/openmpi/3.1.2/gcc-7.3.0_cuda-9.2.88_pmix-2.1.3'
'--enable-wrapper-rpath=no'
'--enable-wrapper-runpath=no' '--with-slurm'
'--with-pmix=/usr/local/pmix/2.1'
'--with-pmi=/usr/local/pmix/2.1'
'--with-pmi-libdir=/usr/local/pmix/2.1/lib'
'--with-cuda=/apps/cuda/9.2.88'
'--with-libevent=/usr'
'CPPFLAGS=-I/usr/local/pmix/2.1/include'
'LDFLAGS=-L/usr/local/pmix/2.1/lib'

Side note: I was unsure what to do for --with-pmi -- I have tried various other settings and this has been the version that has worked the best so far.

Side note 2: I got errors if I did not specific -I and -L in configure flags -- that shouldn't be necessary should it?

Please describe the system on which you are running

  • Operating system/version: RHEL 7
  • Computer hardware: Probably N/A
  • Network type: Probably N/A

Details of the problem

If I launch an MPI program with srun --mpi=openmpi ... I get this error:

OPAL ERROR: Error in file %s/opal/mca/pmix/s1/pmix_s1.c at line 578

Which can be seen here. Then, instead of the program using n ranks there are n independent runs of the program. In this example there are 4 tasks all on one node; the issue manifests in other configurations too:

$ srun -n 4 --mpi=openmpi $prog
[m7-1-2:08013] OPAL ERROR: Error in file /apps/src/openmpi/src/openmpi-3.1.2/opal/mca/pmix/s1/pmix_s1.c at line 578
[m7-1-2:08014] OPAL ERROR: Error in file /apps/src/openmpi/src/openmpi-3.1.2/opal/mca/pmix/s1/pmix_s1.c at line 578
[m7-1-2:08012] OPAL ERROR: Error in file /apps/src/openmpi/src/openmpi-3.1.2/opal/mca/pmix/s1/pmix_s1.c at line 578
[m7-1-2:08015] OPAL ERROR: Error in file /apps/src/openmpi/src/openmpi-3.1.2/opal/mca/pmix/s1/pmix_s1.c at line 578
6.52178
8.49445
8.49241
8.49539

The last four numbers are the outputs from each of the four runs.

Doing srun --mpi=pmix does not exhibit this issue:

$ srun -n 4 --mpi=pmix $prog
2.94764

I am unsure how to debug this; what further information should I provide?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions