-
Notifications
You must be signed in to change notification settings - Fork 928
Description
Background information
I am trying to get OpenMPI and Slurm working working with PMIx, but have been hitting issues in various places.
What versions of software are you using?
- OpenMPI 3.1.2
- Slurm 17.11.9
- PMIx 2.1.3
All three of these were built from source.
PMIx is installed to /usr/local/pmix/2.1 and there is an ld.so.conf.d file that includes its libdir in the ld.so.cache.
Slurm is installed to /usr/include/slurm, /usr/lib64/slurm, etc.
OpenMPI is configured with:
'--prefix=/apps/openmpi/3.1.2/gcc-7.3.0_cuda-9.2.88_pmix-2.1.3'
'--enable-wrapper-rpath=no'
'--enable-wrapper-runpath=no' '--with-slurm'
'--with-pmix=/usr/local/pmix/2.1'
'--with-pmi=/usr/local/pmix/2.1'
'--with-pmi-libdir=/usr/local/pmix/2.1/lib'
'--with-cuda=/apps/cuda/9.2.88'
'--with-libevent=/usr'
'CPPFLAGS=-I/usr/local/pmix/2.1/include'
'LDFLAGS=-L/usr/local/pmix/2.1/lib'
Side note: I was unsure what to do for --with-pmi -- I have tried various other settings and this has been the version that has worked the best so far.
Side note 2: I got errors if I did not specific -I and -L in configure flags -- that shouldn't be necessary should it?
Please describe the system on which you are running
- Operating system/version: RHEL 7
- Computer hardware: Probably N/A
- Network type: Probably N/A
Details of the problem
If I launch an MPI program with srun --mpi=openmpi ... I get this error:
OPAL ERROR: Error in file %s/opal/mca/pmix/s1/pmix_s1.c at line 578
Which can be seen here. Then, instead of the program using n ranks there are n independent runs of the program. In this example there are 4 tasks all on one node; the issue manifests in other configurations too:
$ srun -n 4 --mpi=openmpi $prog
[m7-1-2:08013] OPAL ERROR: Error in file /apps/src/openmpi/src/openmpi-3.1.2/opal/mca/pmix/s1/pmix_s1.c at line 578
[m7-1-2:08014] OPAL ERROR: Error in file /apps/src/openmpi/src/openmpi-3.1.2/opal/mca/pmix/s1/pmix_s1.c at line 578
[m7-1-2:08012] OPAL ERROR: Error in file /apps/src/openmpi/src/openmpi-3.1.2/opal/mca/pmix/s1/pmix_s1.c at line 578
[m7-1-2:08015] OPAL ERROR: Error in file /apps/src/openmpi/src/openmpi-3.1.2/opal/mca/pmix/s1/pmix_s1.c at line 578
6.52178
8.49445
8.49241
8.49539The last four numbers are the outputs from each of the four runs.
Doing srun --mpi=pmix does not exhibit this issue:
$ srun -n 4 --mpi=pmix $prog
2.94764I am unsure how to debug this; what further information should I provide?