Disable test(s) that fail often on GitHub CI for Windows/MPI #1398

sloede · 2023-04-17T06:57:38Z

In the past weeks/months, we have a very high failure rate for CI tests on Windows with MPI that are intermittent and cannot be traced back reasonably to any real programming issue. Some examples:

https://github.com/trixi-framework/Trixi.jl/actions/runs/4703123828/jobs/8341232910
https://github.com/trixi-framework/Trixi.jl/actions/runs/4710627754/jobs/8362973268
https://github.com/trixi-framework/Trixi.jl/actions/runs/4714298641/jobs/8360570746
https://github.com/trixi-framework/Trixi.jl/actions/runs/4698811234/jobs/8351234657
https://github.com/trixi-framework/Trixi.jl/actions/runs/4694820492/jobs/8323339119

Our working hypothesis is that the GitHub runners run out of memory, which is supported by at least some of the failed tests, where we see the following statement in the test failure overview:

This happened, e.g., here:
https://github.com/trixi-framework/Trixi.jl/actions/runs/4702374974

The goal of this PR is to try to disable those tests that only fail on Windows to have regularly passing MPI tests again, and then fix this for real (e.g. by reducing the test size) in a subsequent PR.

codecov · 2023-04-18T10:47:13Z

Codecov Report

Merging #1398 (8c68ddf) into main (d952106) will increase coverage by 0.54%.
The diff coverage is n/a.

❗ Current head 8c68ddf differs from pull request most recent head 08f3a96. Consider uploading reports for the commit 08f3a96 to get more accurate results

@@            Coverage Diff             @@
##             main    #1398      +/-   ##
==========================================
+ Coverage   95.44%   95.97%   +0.54%     
==========================================
  Files         351      351              
  Lines       29122    29122              
==========================================
+ Hits        27793    27949     +156     
+ Misses       1329     1173     -156

Flag	Coverage Δ
unittests	`95.97% <ø> (+0.54%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 6 files with indirect coverage changes

…work/Trixi.jl into msl/triage-windows-mpi-tests

sloede · 2023-04-21T12:51:05Z

Alright... after a lot of trial and error I am now relatively confident that this represents a working set of MPI-parallel tests for GitHub Actions on Windows. The good news is that this set has been run successfully for at least 4 times in a row without fail. The bad news? I had to completely disable all parallel tests of the P4estMesh.

Where to go from now? My current assessment is still that there is might be a fundamental problem with memory usage However, this is something which I believe less and less. Alternateively, there might be an issue with p4est on Windows, or there is a more fundamental issue with MicrosoftMPI. Neither cause can fully explain why the failures were intermittent, only affected some tests and not all of them, and why also TreeMesh tests are failing.

My suggestion for how to proceed would be to merge this PR (possibly after cleaning up the p4est tests file, since it can be excluded in toto), and then to create an issue such that this can be investigated more thoroughly in the future. Thoughts, suggestions?

ranocha

Thanks a lot for looking into this issue! I just have a minor suggestion to discuss.

test/test_mpi_tree.jl

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

ranocha

Thanks!

sloede added 6 commits April 17, 2023 08:57

Disable test(s) that fail often on GitHub CI for Windows/MPI

5f73c09

Merge branch 'main' into msl/triage-windows-mpi-tests

d71bd95

Import missing variable

0b03005

Disable more tests

cb252ee

Merge branch 'main' into msl/triage-windows-mpi-tests

8fce56c

Disable more tests

6b01e1c

sloede added 2 commits April 18, 2023 14:05

Disable more tests

5ee828c

Disable more tests

49eba68

sloede closed this Apr 18, 2023

sloede reopened this Apr 18, 2023

sloede added 2 commits April 18, 2023 17:09

Disable more tests

9871fa5

Merge branch 'main' into msl/triage-windows-mpi-tests

4444240

sloede closed this Apr 18, 2023

sloede reopened this Apr 18, 2023

Disable more tests

4159ee1

sloede closed this Apr 19, 2023

sloede reopened this Apr 19, 2023

sloede added 4 commits April 19, 2023 08:03

Disable more tests

6f6fce0

Call GC.gc() before each test

e39024f

Disable more tests

f60e4f1

Disable more tests

bb6e9ef

sloede force-pushed the msl/triage-windows-mpi-tests branch from f793f3a to bb6e9ef Compare April 19, 2023 12:53

sloede added 7 commits April 19, 2023 16:05

Merge branch 'main' into msl/triage-windows-mpi-tests

b09f8f8

Disable more tests

511f97f

Merge branch 'msl/triage-windows-mpi-tests' of github.com:trixi-frame…

912ddfa

…work/Trixi.jl into msl/triage-windows-mpi-tests

Remove GC.gc() statements since they seem to have no effect

74cbea4

Disable more tests

3859a77

Disable more tests

b0505b6

Disable more tests

2d9bff1

Merge branch 'main' into msl/triage-windows-mpi-tests

6dbe18b

sloede closed this Apr 20, 2023

sloede reopened this Apr 20, 2023

sloede added 2 commits April 25, 2023 19:40

Clean up files

11ac407

Merge branch 'main' into msl/triage-windows-mpi-tests

8c68ddf

sloede marked this pull request as ready for review April 26, 2023 04:40

sloede requested a review from ranocha April 26, 2023 04:42

sloede mentioned this pull request Apr 26, 2023

Investigate why MPI tests on Windows fail and fix them #1410

Open

ranocha requested changes Apr 26, 2023

View reviewed changes

test/test_mpi_tree.jl Outdated Show resolved Hide resolved

Update test/test_mpi_tree.jl

08f3a96

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

sloede requested a review from ranocha April 26, 2023 12:04

ranocha approved these changes Apr 26, 2023

View reviewed changes

ranocha enabled auto-merge (squash) April 26, 2023 12:28

ranocha merged commit 891fb8d into main Apr 26, 2023

ranocha deleted the msl/triage-windows-mpi-tests branch April 26, 2023 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable test(s) that fail often on GitHub CI for Windows/MPI #1398

Disable test(s) that fail often on GitHub CI for Windows/MPI #1398

Uh oh!

sloede commented Apr 17, 2023 •

edited

Loading

Uh oh!

codecov bot commented Apr 18, 2023 •

edited

Loading

Uh oh!

sloede commented Apr 21, 2023

Uh oh!

ranocha left a comment •

edited

Loading

Uh oh!

Uh oh!

ranocha left a comment

Uh oh!

Uh oh!

Disable test(s) that fail often on GitHub CI for Windows/MPI #1398

Disable test(s) that fail often on GitHub CI for Windows/MPI #1398

Uh oh!

Conversation

sloede commented Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sloede commented Apr 21, 2023

Uh oh!

ranocha left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ranocha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sloede commented Apr 17, 2023 •

edited

Loading

codecov bot commented Apr 18, 2023 •

edited

Loading

ranocha left a comment •

edited

Loading