Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flakiness in test_shuffle.py #8074

Open
crusaderky opened this issue Aug 4, 2023 · 3 comments
Open

Flakiness in test_shuffle.py #8074

crusaderky opened this issue Aug 4, 2023 · 3 comments
Labels
flaky test Intermittent failures on CI. tests Unit tests and/or continuous integration

Comments

@crusaderky
Copy link
Collaborator

Several tests in test_shuffle.py are very flaky.

If I change .github/workflows/tests.yaml as follows, to rerun the tests 20 times (ci1 + not ci1) per environment:

          pytest distributed/shuffle/tests/test_shuffle.py --count=10 --runslow \
              --leaks=...

I get the following failure rates:

test n. failures
distributed/shuffle/tests/test_shuffle.py::test_clean_after_close 1
distributed/shuffle/tests/test_shuffle.py::test_closed_input_only_worker_during_transfer 1
distributed/shuffle/tests/test_shuffle.py::test_closed_worker_during_transfer 29
distributed/shuffle/tests/test_shuffle.py::test_crashed_worker_during_transfer 6
distributed/shuffle/tests/test_shuffle.py::test_restarting_during_transfer_raises_killed_worker 38

Additionally, test_crashed_worker_during_transfer deadlocks in a way that's irrecoverable on Windows, causing the whole test suite to be killed by

distributed/pyproject.toml

Lines 155 to 162 in eb297b3

# pytest-timeout settings
# 'thread' kills off the whole test suite. 'signal' only kills the offending test.
# However, 'signal' doesn't work on Windows (due to lack of SIGALRM).
# The CI script modifies this config file on the fly on Linux and MacOS.
timeout_method = "thread"
# This should not be reduced; Windows CI has been observed to be occasionally
# exceptionally slow.
timeout = 300

logs: https://github.com/crusaderky/distributed/actions/runs/5761255813

CC @hendrikmakait

@crusaderky crusaderky added flaky test Intermittent failures on CI. tests Unit tests and/or continuous integration and removed needs triage labels Aug 4, 2023
@hendrikmakait
Copy link
Member

This is a known issue that I forgot to write a dedicated ticket about. (Sorry!)
See #8011 (comment) for a discussion.

#8066 should fix that.

@crusaderky
Copy link
Collaborator Author

@hendrikmakait this is great!
In the CI of #8066 however I can still see a flake of test_crashed_worker_during_transfer

@hendrikmakait
Copy link
Member

hendrikmakait commented Aug 4, 2023

I noticed that as well. There are some tests that occasionally flaked before #7698 was merged, I will have to look into those flakes now that #8066 has been on main. Nonetheless, this should greatly reduce the noise coming from test_shuffle.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI. tests Unit tests and/or continuous integration
Projects
None yet
Development

No branches or pull requests

2 participants