Description
Please submit all the information below so that we can understand the working environment that is the context for your question.
- If you have a problem building or installing Open MPI, be sure to read this.
- If you have a problem launching MPI or OpenSHMEM applications, be sure to read this.
- If you have a problem running MPI or OpenSHMEM applications (i.e., after launching them), be sure to read this.
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
4.1.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
$ ompi_info
Package: Open MPI s_hpcssw@amsdc2-n-sv0040 Distribution
Open MPI: 4.1.3
Open MPI repo revision: v4.1.3
Open MPI release date: Mar 31, 2022
Open RTE: 4.1.3
Open RTE repo revision: v4.1.3
Open RTE release date: Mar 31, 2022
OPAL: 4.1.3
OPAL repo revision: v4.1.3
OPAL release date: Mar 31, 2022
MPI API: 3.1.0
Ident string: 4.1.3
Prefix: /glb/apps/hpc/EasyBuild/software/rhel/7/OpenMPI/4.1.3-GCC-10.3.0-CUDA-11.6.0
Configured architecture: x86_64-pc-linux-gnu
Configure host: amsdc2-n-sv0040
Configured by: s_hpcssw
Configured on: Tue May 3 15:18:59 UTC 2022
Configure host: amsdc2-n-sv0040
Configure command line: '--prefix=/glb/apps/hpc/EasyBuild/software/rhel/7/OpenMPI/4.1.3-GCC-10.3.0-CUDA-11.6.0'
'--build=x86_64-pc-linux-gnu'
'--host=x86_64-pc-linux-gnu'
'--with-cuda=/glb/apps/hpc/EasyBuild/software/rhel/7/CUDA/11.6.0'
'--enable-mpirun-prefix-by-default'
'--enable-shared'
'--with-hwloc=/glb/apps/hpc/EasyBuild/software/rhel/7/hwloc/2.4.1-GCCcore-10.3.0'
'--with-libevent=/glb/apps/hpc/EasyBuild/software/rhel/7/libevent/2.1.12-GCCcore-10.3.0'
'--with-ofi=/glb/apps/hpc/EasyBuild/software/rhel/7/libfabric/1.13.0-GCCcore-10.3.0'
'--with-pmix=/glb/apps/hpc/EasyBuild/software/rhel/7/PMIx/3.2.3-GCCcore-10.3.0'
'--with-ucx=/glb/apps/hpc/EasyBuild/software/rhel/7/UCX/1.10.0-GCCcore-10.3.0'
'--without-verbs'
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
Please describe the system on which you are running
- Operating system/version: RHEL8.9
- Computer hardware: AMD Milan.
- Network type: Infiniband
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
shell$ mpirun -n 2 ./hello_world
When I run using srun without pmi and with pmi2, jobs runs fine.
When I run with pmix it fails or rather just hangs.
[indkwf@houcy1-n-sv0079 ~]$ sbatch -A cldrn -p pt -N 1 -n 8 --wrap="srun -vv --mpi=pmix whereami "
sbatch: The script used is #!/bin/sh
sbatch: # This script was created by sbatch --wrap.
sbatch: srun -vv --mpi=pmix whereami
sbatch: for job submission
Submitted batch job 182280
[indkwf@houcy1-n-sv0079 ~]$ cat slurm-182280.out
srun: defined options
srun: -------------------- --------------------
srun: (null) : houcy1-n-cp337a30
srun: jobid : 182280
srun: job-name : wrap
srun: mem-per-cpu : 1000
srun: mpi : pmix
srun: nodes : 1
srun: ntasks : 8
srun: verbose : 2
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating RLIMIT_CPU=18446744073709551615
srun: debug: propagating RLIMIT_FSIZE=18446744073709551615
srun: debug: propagating RLIMIT_DATA=18446744073709551615
srun: debug: propagating RLIMIT_STACK=18446744073709551615
srun: debug: propagating RLIMIT_CORE=0
srun: debug: propagating RLIMIT_RSS=8388608000
srun: debug: propagating RLIMIT_NOFILE=65535
srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug: propagating RLIMIT_AS=18446744073709551615
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: debug: auth/munge: init: Munge authentication plugin loaded
srun: debug: hash/k12: init: init: KangarooTwelve hash plugin loaded
srun: jobid 182280: nodes(1):`houcy1-n-cp337a30', cpu counts: 8(x1)
srun: debug: requesting job 182280, user 58150, nodes 1 including ((null))
srun: debug: cpus 8, tasks 8, name whereami, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug: Entering slurm_step_launch
srun: debug: mpi/pmix_v3: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:382: Abort agent port: 60150
srun: debug: mpi/pmix_v3: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:281: setup process mapping in srun
srun: debug: Entering _msg_thr_create()
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:353: Start abort thread
srun: debug: initialized stdio listening socket, port 60152
srun: debug: Started IO server thread (22505038939904)
srun: debug: Entering _launch_tasks
srun: launching StepId=182280.0 on host houcy1-n-cp337a30, 8 tasks: [0-7]
srun: route/default: init: route default plugin loaded
srun: debug: launch returned msg_rc=0 err=0 type=8001