-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent infinite transition loops; more aggressive validate_state() #6318
Conversation
Unit Test Results 15 files - 1 15 suites - 1 7h 2m 0s ⏱️ - 27m 11s For more details on these failures, see this check. Results for commit 35a5568. ± Comparison against base commit 9f02e7a. ♻️ This comment has been updated with latest results. |
be9bbe2
to
3cececc
Compare
https://github.com/dask/distributed/runs/6386832455?check_suite_focus=true now shows the issue in #6305 Note: this PR is likely to increase flakiness, and I think it's a good thing, as it will crop up a lot of previously undetected issues which are likely to cause deadlocks somewhere else. |
d84b201
to
35a5568
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though a little odd to expose in top-level config, this seems like a good thing to have and great to have in all tests.
To be clear, this doesn't actually fix #6248 (comment), just causes it to be caught more obviously in tests?
# catch potential infinite recursions | ||
self.transition_counter += 1 | ||
if self.validate and self.transition_counter_max: | ||
assert self.transition_counter < self.transition_counter_max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice if this raised something like a TransitionCounterMaxExceeded
error, to be consistent with workers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, but scheduler doesn't have anything like InvalidTransitionError on the worker
Correct |
@crusaderky can we make the 1-line fix for #6305 now, or do you want to see this take effect in CI first to confirm it's working? |
See #6327 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have a conversation about this. I think this limit should be removed again.
I don't see a huge value in it and would prefer having fewer configuration options, even for tests alone. Whenever we may write stress tests, etc. we'd need to adjust this limit (we can already see a few cases where the "default" limit appears to not be sufficient).
# Cause scheduler and workers to break if they reach this many transitions. | ||
# Used to debug infinite transition loops. | ||
# Note: setting this will cause healthy long-running services to eventually break. | ||
transition-counter-max: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try to not clutter this file with stuff we only use for internal tests. Apart from tests I don't see the usefulness of having a global limit on the number of transitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this is for tests only.
Partially closes #6305
#6305 introduces an infinite loop in the worker transitions.
Such loop never releases the GIL,
which means that the
@gen_cluster
timeout doesn't work,which means that after 5 minutes pytest_timeout kicks in,
which means that you get no logs and no cluster dump whatsoever.
This PR: