Skip to content

btl smcuda hang in v4.1.5 #12270

Closed
Closed
@lrbison

Description

@lrbison

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.x

commit 6c1ecd00767b54c70e524d9d551db1f132c1fca8 (HEAD -> v4.1.x, origin/v4.1.x)
Date:   Thu Jan 18 12:06:31 2024 -0500

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone:

git clone --recurse-submodules -j4 https://github.com/open-mpi/ompi.git  --branch v4.1.x

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

oddly, git submodule status returns 0, but prints nothing of course, no submodules in 4.1.x!

Please describe the system on which you are running

  • Operating system/version: Ubuntu 20.04, Rocky 9, Others
  • Computer hardware: Graviton 3 (*7g)
  • Network type: None

Details of the problem

Started in https://gitlab.com/eessi/support/-/issues/41#note_1738867500

There are some applications that crash or hang when run on c7g.4xlarge. EasyBuild configurations include a patch to build against cuda at all times, however I can reproduce hangs when I do a fresh build without EB patches against debian-provided cuda (no crashes yet)

The symptom I've been able to reproduce is a hang in smcuda btl, so I must configure with cuda support, and we should either exclude ofi at configure or run time. However we are not using CUDA memory, only the smcuda btl.

./configure --with-cuda --prefix=/fsx/tmp/ompi-with-cuda --enable-debug
make -j && make -j install
module load <my-new-build>
mpirun --mca btl ^ofi --mca mtl ^ofi  -n 64 /fsx/lrbison/eessi/mpi-benchmarks/src_c/IMB-MPI1 alltoall -npmin 64

I'm compiling with gcc 12.3.0.

In fully-loaded examples (ie 64 ranks on hpc7g), we can find a hang relatively frequently (10%?) of IMB's allgather or alltoall test. A lightly-loaded node (6/64) may take as many as 300 executions to find a hang.

A backtrace looks like this:

#0  0x0000ffff84320a14 in sm_fifo_read (fifo=0xffff7010d480) at /dev/shm/ompi/opal/mca/btl/smcuda/btl_smcuda.h:315
#1  0x0000ffff84322ce4 in mca_btl_smcuda_component_progress () at btl_smcuda_component.c:1036
#2  0x0000ffff863b5aa4 in opal_progress () at runtime/opal_progress.c:231
#3  0x0000ffff867c3258 in sync_wait_st (sync=0xffffff8f7cf0) at ../opal/threads/wait_sync.h:83
#4  0x0000ffff867c3c50 in ompi_request_default_wait_all (count=5, requests=0x1f7d4708, statuses=0x0) at request/req_wait.c:234
#5  0x0000ffff8688c638 in ompi_coll_base_barrier_intra_basic_linear (comm=0x1f7356d0, module=0x1f7c4d10) at base/coll_base_barrier.c:366
#6  0x0000ffff6fa2dc28 in ompi_coll_tuned_barrier_intra_do_this (comm=0x1f7356d0, module=0x1f7c4d10, algorithm=1, faninout=0, segsize=0)
    at coll_tuned_barrier_decision.c:99
#7  0x0000ffff6fa249f8 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x1f7356d0, module=0x1f7c4d10) at coll_tuned_decision_fixed.c:500
#8  0x0000ffff867e6c54 in PMPI_Barrier (comm=0x1f7356d0) at pbarrier.c:66
#9  0x000000000040e22c in IMB_alltoall ()

I find all ranks in a barrier, and every fifo read comes up with SM_FIFO_FREE, but they are all waiting on some completion. To me this means a message was overwritten/missed/dropped.

I have attempted to reproduce in 5.0.x branch, however smcuda deactivates itself when it cannot initialize an accelerator stream.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions