Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v4.1.x
commit 6c1ecd00767b54c70e524d9d551db1f132c1fca8 (HEAD -> v4.1.x, origin/v4.1.x)
Date: Thu Jan 18 12:06:31 2024 -0500
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone:
git clone --recurse-submodules -j4 https://github.com/open-mpi/ompi.git --branch v4.1.x
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
oddly, git submodule status returns 0, but prints nothing of course, no submodules in 4.1.x!
Please describe the system on which you are running
- Operating system/version: Ubuntu 20.04, Rocky 9, Others
- Computer hardware: Graviton 3 (*7g)
- Network type: None
Details of the problem
Started in https://gitlab.com/eessi/support/-/issues/41#note_1738867500
There are some applications that crash or hang when run on c7g.4xlarge. EasyBuild configurations include a patch to build against cuda at all times, however I can reproduce hangs when I do a fresh build without EB patches against debian-provided cuda (no crashes yet)
The symptom I've been able to reproduce is a hang in smcuda btl, so I must configure with cuda support, and we should either exclude ofi at configure or run time. However we are not using CUDA memory, only the smcuda btl.
./configure --with-cuda --prefix=/fsx/tmp/ompi-with-cuda --enable-debug
make -j && make -j install
module load <my-new-build>
mpirun --mca btl ^ofi --mca mtl ^ofi -n 64 /fsx/lrbison/eessi/mpi-benchmarks/src_c/IMB-MPI1 alltoall -npmin 64
I'm compiling with gcc 12.3.0.
In fully-loaded examples (ie 64 ranks on hpc7g), we can find a hang relatively frequently (10%?) of IMB's allgather or alltoall test. A lightly-loaded node (6/64) may take as many as 300 executions to find a hang.
A backtrace looks like this:
#0 0x0000ffff84320a14 in sm_fifo_read (fifo=0xffff7010d480) at /dev/shm/ompi/opal/mca/btl/smcuda/btl_smcuda.h:315
#1 0x0000ffff84322ce4 in mca_btl_smcuda_component_progress () at btl_smcuda_component.c:1036
#2 0x0000ffff863b5aa4 in opal_progress () at runtime/opal_progress.c:231
#3 0x0000ffff867c3258 in sync_wait_st (sync=0xffffff8f7cf0) at ../opal/threads/wait_sync.h:83
#4 0x0000ffff867c3c50 in ompi_request_default_wait_all (count=5, requests=0x1f7d4708, statuses=0x0) at request/req_wait.c:234
#5 0x0000ffff8688c638 in ompi_coll_base_barrier_intra_basic_linear (comm=0x1f7356d0, module=0x1f7c4d10) at base/coll_base_barrier.c:366
#6 0x0000ffff6fa2dc28 in ompi_coll_tuned_barrier_intra_do_this (comm=0x1f7356d0, module=0x1f7c4d10, algorithm=1, faninout=0, segsize=0)
at coll_tuned_barrier_decision.c:99
#7 0x0000ffff6fa249f8 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x1f7356d0, module=0x1f7c4d10) at coll_tuned_decision_fixed.c:500
#8 0x0000ffff867e6c54 in PMPI_Barrier (comm=0x1f7356d0) at pbarrier.c:66
#9 0x000000000040e22c in IMB_alltoall ()
I find all ranks in a barrier, and every fifo read comes up with SM_FIFO_FREE, but they are all waiting on some completion. To me this means a message was overwritten/missed/dropped.
I have attempted to reproduce in 5.0.x branch, however smcuda deactivates itself when it cannot initialize an accelerator stream.