-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_stress_scatter_death #6305
Comments
Update: it hangs on Ubuntu too: https://github.com/dask/distributed/runs/6349119823?check_suite_focus=true It looks like an infinite transition loop (but I may be wrong)
|
Also seen in this CI run. |
Replicated on high-powered desktop after 51 runs |
Downstream of #6318, I found three separate issues:
I have no evidence to suggest that the last two are recent regressions; they may have been there for a long time but we were not invoking Worker.validate_state before. |
@fjetter @jrbourbeau @jsignell I think this should be treated as a blocker for the next release |
Dumping down my understanding so far of the story:
Let's break it down:
The key was in memory on the worker.
this is saying that the scheduler believes that Worker.handle_compute_task(
key="slowadd-2-17",
who_has={"slowadd-1-17": [w1, w2, ...], "slowadd-1-18": [w1, w2, ...]},
...
) The same method, at the end, calls The
This is what baffles me. This should not be possible when Here we just executed However, this is in the same stimulus_id as in So either
|
Updates and corrections:
|
Found it. The infinite transition loop is a one-liner typo introduced in #6248: https://github.com/dask/distributed/pull/6248/files#r870780929 Downstream of fixing the one-liner, I'm now getting: 7 failures out of 1000:
This failure is (deliberately) introduced by #6318; I don't think it's a recent regression. 2 failures out of 1000:
This again is caused by handle_compute_task with an empty who_has. We should either crash loudly in handle_compute_task or handle it gracefully and keep the task in missing state. |
Downstream of #6327, I have only these failures left: 4 out of 1000
This is a new assertion, which detects earlier both of the issues listed in my post above. |
I now have a better understanding of the error above:
'slowadd-1-38' is a dependent of 'scatter-0' and 'scatter-3', and a dependency of 'slowadd-2-37' and 'slowadd-2-38'. The transitions system calls There has been a recent change on this specific mechanism: #6217 @fjetter |
Given PR ( #6327 ) says it partially closes this issue, should we reopen or did the PR end up fixing this completely? |
As I understand, #6305 (comment) is catching a real (but rarer) problem that still needs to be fixed. Reopening. |
Should be fixed by #6370 |
Not fixed - it's still failing 0.4% of the time on my desktop host and much more frequently on CI. Stack trace:
|
test_stress_scatter_death
has suddenly started hanging very frequently on Windows and MacOSX.According to https://dask.org/distributed/test_report.html, the potential culprit is either #6217 or #6248.
CC @fjetter
Note: in the test report above, the failures on Windows are marked as a white box instead of red due to #6304.
The text was updated successfully, but these errors were encountered: