Prevent infinite transition loops; more aggressive validate_state() #6318

crusaderky · 2022-05-10T15:30:19Z

Partially closes #6305

#6305 introduces an infinite loop in the worker transitions.
Such loop never releases the GIL,
which means that the @gen_cluster timeout doesn't work,
which means that after 5 minutes pytest_timeout kicks in,
which means that you get no logs and no cluster dump whatsoever.

This PR:

Runs validate_state() at the end of every test for all workers, in addition to the scheduler. This excludes workers wrapped by nannies and workers not started by gen_cluster itself.
Runs validate_state() on Scheduler and Workers in case of TimeoutError, hoping to get an error about invalid state instead of the much more opaque timeout message
Implements a limit for the number of transitions a scheduler or a worker can go through before it breaks gracefully after at most 30s (gen_cluster timeout). When it happens, pytest records all logs and a full cluster dump is generated.

github-actions · 2022-05-10T17:36:01Z

Unit Test Results

      15 files -       1       15 suites - 1 7h 2m 0s ⏱️ - 27m 11s
  2 774 tests +      3   2 695 ✔️ +      4   78 💤 -     1 1 ❌ ±0
20 580 runs - 1 550 19 671 ✔️ - 1 439 908 💤 - 111 1 ❌ ±0

For more details on these failures, see this check.

Results for commit 35a5568. ± Comparison against base commit 9f02e7a.

♻️ This comment has been updated with latest results.

crusaderky · 2022-05-11T12:18:46Z

https://github.com/dask/distributed/runs/6386832455?check_suite_focus=true now shows the issue in #6305
The other two test failures are unrelated.

Note: this PR is likely to increase flakiness, and I think it's a good thing, as it will crop up a lot of previously undetected issues which are likely to cause deadlocks somewhere else.

gjoseph92

Though a little odd to expose in top-level config, this seems like a good thing to have and great to have in all tests.

To be clear, this doesn't actually fix #6248 (comment), just causes it to be caught more obviously in tests?

distributed/scheduler.py

gjoseph92 · 2022-05-11T22:39:47Z

distributed/scheduler.py

+            #   catch potential infinite recursions
+            self.transition_counter += 1
+            if self.validate and self.transition_counter_max:
+                assert self.transition_counter < self.transition_counter_max


Might be nice if this raised something like a TransitionCounterMaxExceeded error, to be consistent with workers

yeah, but scheduler doesn't have anything like InvalidTransitionError on the worker

crusaderky · 2022-05-12T10:52:33Z

To be clear, this doesn't actually fix #6248 (comment), just causes it to be caught more obviously in tests?

Correct

gjoseph92 · 2022-05-12T16:53:01Z

@crusaderky can we make the 1-line fix for #6305 now, or do you want to see this take effect in CI first to confirm it's working?

crusaderky · 2022-05-12T23:09:33Z

@crusaderky can we make the 1-line fix for #6305 now, or do you want to see this take effect in CI first to confirm it's working?

See #6327

fjetter

I would like to have a conversation about this. I think this limit should be removed again.

I don't see a huge value in it and would prefer having fewer configuration options, even for tests alone. Whenever we may write stress tests, etc. we'd need to adjust this limit (we can already see a few cases where the "default" limit appears to not be sufficient).

fjetter · 2022-05-16T09:04:14Z

distributed/distributed.yaml

+    # Cause scheduler and workers to break if they reach this many transitions.
+    # Used to debug infinite transition loops.
+    # Note: setting this will cause healthy long-running services to eventually break.
+    transition-counter-max: False


I think we should try to not clutter this file with stuff we only use for internal tests. Apart from tests I don't see the usefulness of having a global limit on the number of transitions

Indeed, this is for tests only.

crusaderky mentioned this pull request May 10, 2022

Inconsistent Worker.waiting_for_data_count #6319

Closed

Prevent infinite transition loops

3cececc

crusaderky force-pushed the transition_counter_max branch from be9bbe2 to 3cececc Compare May 11, 2022 10:34

crusaderky mentioned this pull request May 11, 2022

test_stress_scatter_death #6305

Closed

validate on timeout

e4f8d9c

crusaderky changed the title ~~Prevent infinite transition loops~~ Prevent infinite transition loops; more aggressive validate_state() May 11, 2022

crusaderky self-assigned this May 11, 2022

crusaderky marked this pull request as ready for review May 11, 2022 12:16

crusaderky added a commit to crusaderky/distributed that referenced this pull request May 11, 2022

Prevent infinite transition loops (dask#6318)

deb8ad1

crusaderky mentioned this pull request May 11, 2022

Validate and debug state machine on handle_compute_task #6327

Merged

Merge branch 'main' into transition_counter_max

35a5568

crusaderky force-pushed the transition_counter_max branch from d84b201 to 35a5568 Compare May 11, 2022 20:47

crusaderky added a commit to crusaderky/distributed that referenced this pull request May 11, 2022

Prevent infinite transition loops (dask#6318)

5616895

gjoseph92 approved these changes May 11, 2022

View reviewed changes

Don't require validate flag

678fd02

crusaderky merged commit d0fbba6 into dask:main May 12, 2022

crusaderky deleted the transition_counter_max branch May 12, 2022 10:53

fjetter reviewed May 16, 2022

View reviewed changes

fjetter mentioned this pull request May 16, 2022

Race conditions from fetch to compute while AMM requests replica #6248

Merged

crusaderky mentioned this pull request May 16, 2022

Remove transition-counter-max from config #6349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Prevent infinite transition loops; more aggressive validate_state() #6318

Prevent infinite transition loops; more aggressive validate_state() #6318

Uh oh!

crusaderky commented May 10, 2022 •

edited

Loading

Uh oh!

github-actions bot commented May 10, 2022 •

edited

Loading

Uh oh!

crusaderky commented May 11, 2022

Uh oh!

gjoseph92 left a comment

Uh oh!

Uh oh!

gjoseph92 May 11, 2022

Uh oh!

crusaderky May 12, 2022

Uh oh!

crusaderky commented May 12, 2022

Uh oh!

gjoseph92 commented May 12, 2022

Uh oh!

crusaderky commented May 12, 2022

Uh oh!

fjetter left a comment

Uh oh!

fjetter May 16, 2022

Uh oh!

crusaderky May 16, 2022

Uh oh!

Uh oh!

Uh oh!

Prevent infinite transition loops; more aggressive validate_state() #6318

Prevent infinite transition loops; more aggressive validate_state() #6318

Uh oh!

Conversation

crusaderky commented May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

crusaderky commented May 11, 2022

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gjoseph92 May 11, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky May 12, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky commented May 12, 2022

Uh oh!

gjoseph92 commented May 12, 2022

Uh oh!

crusaderky commented May 12, 2022

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

fjetter May 16, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky May 16, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

crusaderky commented May 10, 2022 •

edited

Loading

github-actions bot commented May 10, 2022 •

edited

Loading