Periodic AWS CI hangs

We're getting periodic hangs in simple AWS CI tests on master.  For example, in PR #7846 -- which only changed the VERSIONS file, and didn't change any code at all -- the `ring_usempif08` test appeared to have hung (specific test, for however long it remains available in Jenkins: https://jenkins.open-mpi.org/jenkins/job/open-mpi.build.configure_options/6631/CONFIGURE_OPTIONS=--disable-oshmem/console):

```
--> Running example: hello_usempif08
Hello, world, I am  1 of  2: Open MPI v5.0.0a1, package: Open MPI ubuntu@ip-172-31-13-115 Distribution, ident: 5.0.0a1, repo rev: v2.x-dev-7856-gcfad367, Unreleased developer copy                                                                                                         
Hello, world, I am  0 of  2: Open MPI v5.0.0a1, package: Open MPI ubuntu@ip-172-31-13-115 Distribution, ident: 5.0.0a1, repo rev: v2.x-dev-7856-gcfad367, Unreleased developer copy                                                                                                         
--> Running example: ring_usempif08
[ip-172-31-13-115:00938] *** Process received signal ***
[ip-172-31-13-115:00938] Signal: Segmentation fault (11)
[ip-172-31-13-115:00938] Signal code: Address not mapped (1)
[ip-172-31-13-115:00938] Failing at address: (nil)
[ip-172-31-13-115:00938] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f56cbe63390]
[ip-172-31-13-115:00938] *** End of error message ***
timeout: the monitored command dumped core
Segmentation fault
Example failed: 139
Command was: timeout -s SIGSEGV 4m mpirun --get-stack-traces --timeout 180 --hostfile /home/ubuntu/workspace/open-mpi.build.configure_options/CONFIGURE_OPTIONS/--disable-oshmem/hostfile -np 2  ./examples/ring_usempif08
```

That's a 3-minute timeout on a trivial ring program.

This is after a bunch of other hello world / ring tests have already passed.

A bot retest will likely clear the error and enable us to merge the PR.  But it points to the fact that there's likely a real error here -- perhaps a race condition of some kind.  I think we'll need to somehow replicate this outside of the AWS CI environment in order to diagnose this further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Periodic AWS CI hangs #7847

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Periodic AWS CI hangs #7847

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions