Skip to content

Periodic AWS CI hangs #7847

Closed
Closed
@jsquyres

Description

@jsquyres

We're getting periodic hangs in simple AWS CI tests on master. For example, in PR #7846 -- which only changed the VERSIONS file, and didn't change any code at all -- the ring_usempif08 test appeared to have hung (specific test, for however long it remains available in Jenkins: https://jenkins.open-mpi.org/jenkins/job/open-mpi.build.configure_options/6631/CONFIGURE_OPTIONS=--disable-oshmem/console):

--> Running example: hello_usempif08
Hello, world, I am  1 of  2: Open MPI v5.0.0a1, package: Open MPI ubuntu@ip-172-31-13-115 Distribution, ident: 5.0.0a1, repo rev: v2.x-dev-7856-gcfad367, Unreleased developer copy                                                                                                         
Hello, world, I am  0 of  2: Open MPI v5.0.0a1, package: Open MPI ubuntu@ip-172-31-13-115 Distribution, ident: 5.0.0a1, repo rev: v2.x-dev-7856-gcfad367, Unreleased developer copy                                                                                                         
--> Running example: ring_usempif08
[ip-172-31-13-115:00938] *** Process received signal ***
[ip-172-31-13-115:00938] Signal: Segmentation fault (11)
[ip-172-31-13-115:00938] Signal code: Address not mapped (1)
[ip-172-31-13-115:00938] Failing at address: (nil)
[ip-172-31-13-115:00938] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f56cbe63390]
[ip-172-31-13-115:00938] *** End of error message ***
timeout: the monitored command dumped core
Segmentation fault
Example failed: 139
Command was: timeout -s SIGSEGV 4m mpirun --get-stack-traces --timeout 180 --hostfile /home/ubuntu/workspace/open-mpi.build.configure_options/CONFIGURE_OPTIONS/--disable-oshmem/hostfile -np 2  ./examples/ring_usempif08

That's a 3-minute timeout on a trivial ring program.

This is after a bunch of other hello world / ring tests have already passed.

A bot retest will likely clear the error and enable us to merge the PR. But it points to the fact that there's likely a real error here -- perhaps a race condition of some kind. I think we'll need to somehow replicate this outside of the AWS CI environment in order to diagnose this further.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions