-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler stops itself due to idle timeout, even though workers should still be working #5675
Comments
Just want to note that with such a long-running scheduler, #5667 could be an issue too. Didn't see anything in the logs related to client disconnect, though. |
Sorry for the late reply. What version was this running on? |
@sevberg I don't think we were ever able to reproduce it, right? Have you still been having problems with it? |
Sorry to add a "me too" but we're seeing the same thing; the scheduler is shutting down even if there's work pending...
It looks like an issue with running on EC2 instances - I'll try and get some more logging information from CloudWatch and at least put together a sample which allows us to recreate the issue. |
@dthompson-maystreet please also provide the version you are using. Since your logs do not contain any timestamps but use otherwise default formatting, I'm wondering if you are using a relatively old version (we added timestamps to the default log format in march / #5897; If this issue still persists on newer versions, it would be helpful if you could provide more context about the multiple clients, the computations you submitted, etc. For everyone that might be affected by this, please confirm that you are running on a reasonably new version. There was a fix about idle detection in June (#6563) which was released in |
Thank you very much for replying; we did some investigations and here's what we found. What we've found out is we think we're using Dask in a way it wasn't designed to do. As part of a test we were firing single short-running jobs at a Dask cluster on a timer; these would run so quickly that when the If we enqueue work on a backlog, or we're constantly firing work at the scheduler, then everything works fine. I think there could be an argument that the Would be interested in your thoughts here too. Thank you for your time! |
The patch in #6563 / |
A user has reported that a long-running scheduler (up for ~18h) appears to have shut itself down because it thought it was idle. However, the scheduler had plenty of work left to do.
The scheduler logs looked like:
Without timestamps (#4762), we don't know if those "event loop unresponsive" pauses were related and happened immediately prior to the idle shutdown, or if they were long before.
The confusing thing is that the idle shutdown only happens if no workers are processing any tasks (and there are no unrunnable tasks). Meaning that, during every check for 300s, either:
check_idle
code, for reference:distributed/distributed/scheduler.py
Lines 7865 to 7882 in 2a3ee56
Yet in the logs immediately following from
close
, we can see:processing: 2
,processing: 12
, etc.)One thing I do notice is that
time()
is not monotonic (we should usetime.monotonic()
instead, xref #4528). So in theory, if the system clock changed (possible on that long-running of a scheduler?), we might not be waiting the full 300s. That still doesn't explain how we got in a situation where the scheduler thought there were no workers processing, though—but if that situation happened to overlap with a system clock forward-jump, the shutdown could get triggered immediately instead of actually requiring 300s to pass.One other thing to note is that (according to logs), over the lifetime of the cluster, workers connected 114 times, and disconnected 79 times. So there was a lot of worker churn. When a worker disconnects, you'd see logs like:
I mention this also because there are known issues around BatchedSend reconnecting (#5481, review of #5457 for more issues). I'm not sure whether it matters in this particular case though.
Note that these numbers leave 35 workers unaccounted for. That is, after the
Scheduler closing after being idle for 300.00 s
message, we see 12Remove worker
messages. However, by my count of how many times the scheduler registered a worker, we should have seen 47Remove worker
messages at the end. (When the cluster first started, 47 workers connected, so this number lines up.)cc @sevberg
The text was updated successfully, but these errors were encountered: