Description
We're getting periodic hangs in simple AWS CI tests on master. For example, in PR #7846 -- which only changed the VERSIONS file, and didn't change any code at all -- the ring_usempif08
test appeared to have hung (specific test, for however long it remains available in Jenkins: https://jenkins.open-mpi.org/jenkins/job/open-mpi.build.configure_options/6631/CONFIGURE_OPTIONS=--disable-oshmem/console):
--> Running example: hello_usempif08
Hello, world, I am 1 of 2: Open MPI v5.0.0a1, package: Open MPI ubuntu@ip-172-31-13-115 Distribution, ident: 5.0.0a1, repo rev: v2.x-dev-7856-gcfad367, Unreleased developer copy
Hello, world, I am 0 of 2: Open MPI v5.0.0a1, package: Open MPI ubuntu@ip-172-31-13-115 Distribution, ident: 5.0.0a1, repo rev: v2.x-dev-7856-gcfad367, Unreleased developer copy
--> Running example: ring_usempif08
[ip-172-31-13-115:00938] *** Process received signal ***
[ip-172-31-13-115:00938] Signal: Segmentation fault (11)
[ip-172-31-13-115:00938] Signal code: Address not mapped (1)
[ip-172-31-13-115:00938] Failing at address: (nil)
[ip-172-31-13-115:00938] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f56cbe63390]
[ip-172-31-13-115:00938] *** End of error message ***
timeout: the monitored command dumped core
Segmentation fault
Example failed: 139
Command was: timeout -s SIGSEGV 4m mpirun --get-stack-traces --timeout 180 --hostfile /home/ubuntu/workspace/open-mpi.build.configure_options/CONFIGURE_OPTIONS/--disable-oshmem/hostfile -np 2 ./examples/ring_usempif08
That's a 3-minute timeout on a trivial ring program.
This is after a bunch of other hello world / ring tests have already passed.
A bot retest will likely clear the error and enable us to merge the PR. But it points to the fact that there's likely a real error here -- perhaps a race condition of some kind. I think we'll need to somehow replicate this outside of the AWS CI environment in order to diagnose this further.