Skip to content

Conversation

sloede
Copy link
Member

@sloede sloede commented Apr 17, 2023

In the past weeks/months, we have a very high failure rate for CI tests on Windows with MPI that are intermittent and cannot be traced back reasonably to any real programming issue. Some examples:

https://github.com/trixi-framework/Trixi.jl/actions/runs/4703123828/jobs/8341232910
https://github.com/trixi-framework/Trixi.jl/actions/runs/4710627754/jobs/8362973268
https://github.com/trixi-framework/Trixi.jl/actions/runs/4714298641/jobs/8360570746
https://github.com/trixi-framework/Trixi.jl/actions/runs/4698811234/jobs/8351234657
https://github.com/trixi-framework/Trixi.jl/actions/runs/4694820492/jobs/8323339119

Our working hypothesis is that the GitHub runners run out of memory, which is supported by at least some of the failed tests, where we see the following statement in the test failure overview:
image

This happened, e.g., here:
https://github.com/trixi-framework/Trixi.jl/actions/runs/4702374974

The goal of this PR is to try to disable those tests that only fail on Windows to have regularly passing MPI tests again, and then fix this for real (e.g. by reducing the test size) in a subsequent PR.

@codecov
Copy link

codecov bot commented Apr 18, 2023

Codecov Report

Merging #1398 (8c68ddf) into main (d952106) will increase coverage by 0.54%.
The diff coverage is n/a.

❗ Current head 8c68ddf differs from pull request most recent head 08f3a96. Consider uploading reports for the commit 08f3a96 to get more accurate results

@@            Coverage Diff             @@
##             main    #1398      +/-   ##
==========================================
+ Coverage   95.44%   95.97%   +0.54%     
==========================================
  Files         351      351              
  Lines       29122    29122              
==========================================
+ Hits        27793    27949     +156     
+ Misses       1329     1173     -156     
Flag Coverage Δ
unittests 95.97% <ø> (+0.54%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 6 files with indirect coverage changes

@sloede sloede closed this Apr 18, 2023
@sloede sloede reopened this Apr 18, 2023
@sloede sloede closed this Apr 18, 2023
@sloede sloede reopened this Apr 18, 2023
@sloede sloede closed this Apr 19, 2023
@sloede sloede reopened this Apr 19, 2023
@sloede sloede force-pushed the msl/triage-windows-mpi-tests branch from f793f3a to bb6e9ef Compare April 19, 2023 12:53
@sloede sloede closed this Apr 20, 2023
@sloede sloede reopened this Apr 20, 2023
@sloede
Copy link
Member Author

sloede commented Apr 21, 2023

Alright... after a lot of trial and error I am now relatively confident that this represents a working set of MPI-parallel tests for GitHub Actions on Windows. The good news is that this set has been run successfully for at least 4 times in a row without fail. The bad news? I had to completely disable all parallel tests of the P4estMesh.

Where to go from now? My current assessment is still that there is might be a fundamental problem with memory usage However, this is something which I believe less and less. Alternateively, there might be an issue with p4est on Windows, or there is a more fundamental issue with MicrosoftMPI. Neither cause can fully explain why the failures were intermittent, only affected some tests and not all of them, and why also TreeMesh tests are failing.

My suggestion for how to proceed would be to merge this PR (possibly after cleaning up the p4est tests file, since it can be excluded in toto), and then to create an issue such that this can be investigated more thoroughly in the future. Thoughts, suggestions?

@sloede sloede marked this pull request as ready for review April 26, 2023 04:40
@sloede sloede requested a review from ranocha April 26, 2023 04:42
Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for looking into this issue! I just have a minor suggestion to discuss.

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>
@sloede sloede requested a review from ranocha April 26, 2023 12:04
Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ranocha ranocha enabled auto-merge (squash) April 26, 2023 12:28
@ranocha ranocha merged commit 891fb8d into main Apr 26, 2023
@ranocha ranocha deleted the msl/triage-windows-mpi-tests branch April 26, 2023 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants