btl smcuda hang in v4.1.5

## Background information

### What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.x

```
commit 6c1ecd00767b54c70e524d9d551db1f132c1fca8 (HEAD -> v4.1.x, origin/v4.1.x)
Date:   Thu Jan 18 12:06:31 2024 -0500
```

### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone:
```
git clone --recurse-submodules -j4 https://github.com/open-mpi/ompi.git  --branch v4.1.x
```

### If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

~~oddly, git submodule status returns 0, but prints nothing~~ of course, no submodules in 4.1.x!

### Please describe the system on which you are running

* Operating system/version: Ubuntu 20.04, Rocky 9, Others
* Computer hardware: Graviton 3 (*7g)
* Network type: None

-----------------------------

## Details of the problem

Started in https://gitlab.com/eessi/support/-/issues/41#note_1738867500

There are some applications that crash or hang when run on c7g.4xlarge.  EasyBuild configurations include a patch to build against cuda at all times, however I can reproduce hangs when I do a fresh build without EB patches against debian-provided cuda (no crashes yet)

The symptom I've been able to reproduce is a hang in smcuda btl, so I must configure with cuda support, and we should either exclude ofi at configure or run time.  However we are not using CUDA memory, only the smcuda btl.

```
./configure --with-cuda --prefix=/fsx/tmp/ompi-with-cuda --enable-debug
make -j && make -j install
module load <my-new-build>
mpirun --mca btl ^ofi --mca mtl ^ofi  -n 64 /fsx/lrbison/eessi/mpi-benchmarks/src_c/IMB-MPI1 alltoall -npmin 64
```
I'm compiling with gcc 12.3.0.

In fully-loaded examples (ie 64 ranks on hpc7g), we can find a hang relatively frequently (10%?) of IMB's allgather or alltoall test.  A lightly-loaded node (6/64) may take as many as 300 executions to find a hang.

A backtrace looks like this:

```
#0  0x0000ffff84320a14 in sm_fifo_read (fifo=0xffff7010d480) at /dev/shm/ompi/opal/mca/btl/smcuda/btl_smcuda.h:315
#1  0x0000ffff84322ce4 in mca_btl_smcuda_component_progress () at btl_smcuda_component.c:1036
#2  0x0000ffff863b5aa4 in opal_progress () at runtime/opal_progress.c:231
#3  0x0000ffff867c3258 in sync_wait_st (sync=0xffffff8f7cf0) at ../opal/threads/wait_sync.h:83
#4  0x0000ffff867c3c50 in ompi_request_default_wait_all (count=5, requests=0x1f7d4708, statuses=0x0) at request/req_wait.c:234
#5  0x0000ffff8688c638 in ompi_coll_base_barrier_intra_basic_linear (comm=0x1f7356d0, module=0x1f7c4d10) at base/coll_base_barrier.c:366
#6  0x0000ffff6fa2dc28 in ompi_coll_tuned_barrier_intra_do_this (comm=0x1f7356d0, module=0x1f7c4d10, algorithm=1, faninout=0, segsize=0)
    at coll_tuned_barrier_decision.c:99
#7  0x0000ffff6fa249f8 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x1f7356d0, module=0x1f7c4d10) at coll_tuned_decision_fixed.c:500
#8  0x0000ffff867e6c54 in PMPI_Barrier (comm=0x1f7356d0) at pbarrier.c:66
#9  0x000000000040e22c in IMB_alltoall ()
```

I find all ranks in a barrier, and every fifo read comes up with SM_FIFO_FREE, but they are all waiting on some completion.  To me this means a message was overwritten/missed/dropped.

I have attempted to reproduce in 5.0.x branch, however smcuda deactivates itself when it cannot initialize an accelerator stream.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

btl smcuda hang in v4.1.5 #12270

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

btl smcuda hang in v4.1.5 #12270

Description

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.