Skip to content

OpenMPI 4.1.3 with pmix #12381

Open
Open
@kvdheeraj84

Description

@kvdheeraj84

Please submit all the information below so that we can understand the working environment that is the context for your question.

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

4.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

$ ompi_info
Package: Open MPI s_hpcssw@amsdc2-n-sv0040 Distribution
Open MPI: 4.1.3
Open MPI repo revision: v4.1.3
Open MPI release date: Mar 31, 2022
Open RTE: 4.1.3
Open RTE repo revision: v4.1.3
Open RTE release date: Mar 31, 2022
OPAL: 4.1.3
OPAL repo revision: v4.1.3
OPAL release date: Mar 31, 2022
MPI API: 3.1.0
Ident string: 4.1.3
Prefix: /glb/apps/hpc/EasyBuild/software/rhel/7/OpenMPI/4.1.3-GCC-10.3.0-CUDA-11.6.0
Configured architecture: x86_64-pc-linux-gnu
Configure host: amsdc2-n-sv0040
Configured by: s_hpcssw
Configured on: Tue May 3 15:18:59 UTC 2022
Configure host: amsdc2-n-sv0040
Configure command line: '--prefix=/glb/apps/hpc/EasyBuild/software/rhel/7/OpenMPI/4.1.3-GCC-10.3.0-CUDA-11.6.0'
'--build=x86_64-pc-linux-gnu'
'--host=x86_64-pc-linux-gnu'
'--with-cuda=/glb/apps/hpc/EasyBuild/software/rhel/7/CUDA/11.6.0'
'--enable-mpirun-prefix-by-default'
'--enable-shared'
'--with-hwloc=/glb/apps/hpc/EasyBuild/software/rhel/7/hwloc/2.4.1-GCCcore-10.3.0'
'--with-libevent=/glb/apps/hpc/EasyBuild/software/rhel/7/libevent/2.1.12-GCCcore-10.3.0'
'--with-ofi=/glb/apps/hpc/EasyBuild/software/rhel/7/libfabric/1.13.0-GCCcore-10.3.0'
'--with-pmix=/glb/apps/hpc/EasyBuild/software/rhel/7/PMIx/3.2.3-GCCcore-10.3.0'
'--with-ucx=/glb/apps/hpc/EasyBuild/software/rhel/7/UCX/1.10.0-GCCcore-10.3.0'
'--without-verbs'

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: RHEL8.9
  • Computer hardware: AMD Milan.
  • Network type: Infiniband

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world

When I run using srun without pmi and with pmi2, jobs runs fine.
When I run with pmix it fails or rather just hangs.
[indkwf@houcy1-n-sv0079 ~]$ sbatch -A cldrn -p pt -N 1 -n 8 --wrap="srun -vv --mpi=pmix whereami "
sbatch: The script used is #!/bin/sh
sbatch: # This script was created by sbatch --wrap.
sbatch: srun -vv --mpi=pmix whereami
sbatch: for job submission
Submitted batch job 182280
[indkwf@houcy1-n-sv0079 ~]$ cat slurm-182280.out
srun: defined options
srun: -------------------- --------------------
srun: (null) : houcy1-n-cp337a30
srun: jobid : 182280
srun: job-name : wrap
srun: mem-per-cpu : 1000
srun: mpi : pmix
srun: nodes : 1
srun: ntasks : 8
srun: verbose : 2
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating RLIMIT_CPU=18446744073709551615
srun: debug: propagating RLIMIT_FSIZE=18446744073709551615
srun: debug: propagating RLIMIT_DATA=18446744073709551615
srun: debug: propagating RLIMIT_STACK=18446744073709551615
srun: debug: propagating RLIMIT_CORE=0
srun: debug: propagating RLIMIT_RSS=8388608000
srun: debug: propagating RLIMIT_NOFILE=65535
srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug: propagating RLIMIT_AS=18446744073709551615
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: debug: auth/munge: init: Munge authentication plugin loaded
srun: debug: hash/k12: init: init: KangarooTwelve hash plugin loaded
srun: jobid 182280: nodes(1):`houcy1-n-cp337a30', cpu counts: 8(x1)
srun: debug: requesting job 182280, user 58150, nodes 1 including ((null))
srun: debug: cpus 8, tasks 8, name whereami, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug: Entering slurm_step_launch
srun: debug: mpi/pmix_v3: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:382: Abort agent port: 60150
srun: debug: mpi/pmix_v3: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:281: setup process mapping in srun
srun: debug: Entering _msg_thr_create()
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:353: Start abort thread
srun: debug: initialized stdio listening socket, port 60152
srun: debug: Started IO server thread (22505038939904)
srun: debug: Entering _launch_tasks
srun: launching StepId=182280.0 on host houcy1-n-cp337a30, 8 tasks: [0-7]
srun: route/default: init: route default plugin loaded
srun: debug: launch returned msg_rc=0 err=0 type=8001

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions