Description
Dear Support,
I am using opnempi version 4.1.5 installed using spack
[root@login-cluster-1 hello-world]# mpirun --version
mpirun (Open MPI) 4.1.5
If i run a simple hello using mpirun everything works fine, my problem is that when i try using srun (slurm scheduler with PMIX) on multiple nodes
[root@login-cluster-1 hello-world]# mpirun --allow-run-as-root -n 16 --hostfile hostfile --prefix /scratch/hussein/spack/myenv /scratch/hussein/hello-world/hello_mpi
Hello from process 12 of 16
Hello from process 3 of 16
Hello from process 14 of 16
Hello from process 5 of 16
Hello from process 13 of 16
Hello from process 15 of 16
Hello from process 6 of 16
Hello from process 11 of 16
Hello from process 4 of 16
Hello from process 9 of 16
Hello from process 10 of 16
Hello from process 1 of 16
Hello from process 2 of 16
Hello from process 8 of 16
Hello from process 0 of 16
Hello from process 7 of 16
running srun on a single node using PMIX:
[root@login-cluster-1 hello-world]# srun -N1 -n 16 --mpi=pmix /scratch/hussein/hello-world/hello_mpi
Hello from process 1 of 16
Hello from process 15 of 16
Hello from process 0 of 16
Hello from process 2 of 16
Hello from process 3 of 16
Hello from process 4 of 16
Hello from process 5 of 16
Hello from process 8 of 16
Hello from process 9 of 16
Hello from process 10 of 16
Hello from process 11 of 16
Hello from process 12 of 16
Hello from process 13 of 16
Hello from process 14 of 16
Hello from process 6 of 16
Hello from process 7 of 16
The version of supported PMIX :
[root@login-cluster-1 hello-world]# srun --mpi=list
MPI plugin types are...
none
pmi2
pmix
specific pmix plugin versions available: pmix_v3
An example of running hello_mpi on two nodes
[root@login-cluster-1 hello-world]# srun -vv -N2 -n 16 --mpi=pmix /scratch/hussein/hello-world/hello_mpi
srun: defined options
srun: -------------------- --------------------
srun: mpi : pmix
srun: nodes : 2
srun: ntasks : 16
srun: verbose : 2
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating RLIMIT_CPU=18446744073709551615
srun: debug: propagating RLIMIT_FSIZE=18446744073709551615
srun: debug: propagating RLIMIT_DATA=18446744073709551615
srun: debug: propagating RLIMIT_STACK=8388608
srun: debug: propagating RLIMIT_CORE=0
srun: debug: propagating RLIMIT_RSS=18446744073709551615
srun: debug: propagating RLIMIT_NPROC=18446744073709551615
srun: debug: propagating RLIMIT_NOFILE=1024
srun: debug: propagating RLIMIT_AS=18446744073709551615
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: debug: Entering slurm_allocation_msg_thr_create()
srun: debug: port from net_stream_listen is 38699
srun: debug: Entering _msg_thr_internal
srun: debug: auth/munge: init: Munge authentication plugin loaded
srun: debug: hash/k12: init: init: KangarooTwelve hash plugin loaded
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: Nodes compute001-cluster-1,compute002-cluster-1 are ready for job
srun: jobid 14: nodes(2):`compute001-cluster-1,compute002-cluster-1', cpu counts: 15(x1),1(x1)
srun: debug: requesting job 14, user 0, nodes 2 including ((null))
srun: debug: cpus 16, tasks 16, name hello_mpi, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug: Entering slurm_step_launch
srun: debug: mpi/pmix_v3: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:382: Abort agent port: 36317
srun: debug: mpi/pmix_v3: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:281: setup process mapping in srun
srun: debug: Entering _msg_thr_create()
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:353: Start abort thread
srun: debug: initialized stdio listening socket, port 46811
srun: debug: Started IO server thread (140168583353920)
srun: debug: Entering _launch_tasks
srun: launching StepId=14.0 on host compute001-cluster-1, 15 tasks: [0-14]
srun: launching StepId=14.0 on host compute002-cluster-1, 1 tasks: 15
srun: route/default: init: route default plugin loaded
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: Complete StepId=14.0+0 received
srun: launch/slurm: launch_p_step_launch: StepId=14.0 aborted before step completely launched.
srun: error: task 15 launch failed: Unspecified error
srun: launch/slurm: _task_start: Node compute002-cluster-1, 1 tasks started
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Complete StepId=14.0+0 received
srun: launch/slurm: _task_start: Node compute001-cluster-1, 15 tasks started
slurmstepd: error: *** STEP 14.0 ON compute001-cluster-1 CANCELLED AT 2023-03-07T15:21:45 ***
srun: launch/slurm: _task_finish: Received task exit notification for 15 tasks of StepId=14.0 (status=0x0009).
srun: error: compute001-cluster-1: tasks 0-14: Killed
srun: launch/slurm: _step_signal: Terminating StepId=14.0
srun: debug: task 0 done
srun: debug: task 1 done
srun: debug: task 2 done
srun: debug: task 3 done
srun: debug: task 4 done
srun: debug: task 5 done
srun: debug: task 6 done
srun: debug: task 7 done
srun: debug: task 8 done
srun: debug: task 9 done
srun: debug: task 10 done
srun: debug: task 11 done
srun: debug: task 12 done
srun: debug: task 13 done
srun: debug: task 14 done
srun: debug: IO thread exiting
srun: debug: mpi/pmix_v3: _conn_readable: (null) [0]: pmixp_agent.c:105: false, shutdown
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:355: Abort thread exit
srun: debug: Leaving _msg_thr_internal
Thanks for your support