-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler shuts down after 20 minutes of inactivity - tasks not executed #5921
Comments
forman-scheduler-121956-logs.log |
Woo, more problems with BatchedSend! Not exactly #5480, but I'm not sure what else to reference. Also xref #4239. In the "better scheduler logs", I also see
which makes me think the client disconnected and reconnected, i.e. #5667. I don't remember what the default timeouts are, but maybe the 49sec of unresponsiveness on the scheduler would be enough for the client to think the connection was broken? The disconnection would have caused all that client's tasks to be dropped, which could explain why the scheduler thought it was idle. I'm also very curious why there's a full hour delay (according to timestamps) for a client reconnecting though? The timeline doesn't add up; the user reports the cluster shutting down from 20min of inactivity, but the logs keep going for more than an hour after the client disconnects. @lojohnson these logs don't seem to include the scheduler starting up or shutting down; I'd be curious full logs for the entire runtime if it's possible to get them. cc @fjetter |
I was able to get logging from the start and end of that scheduler from Cloudwatch logging (exported as .csv) |
There is no such timeout. Once the connection is established we keep it open forever. distributed/distributed/client.py Lines 1378 to 1395 in 30f0b60
The client only disconnects if an OSError is raised. A bunch of network things can happen to tear this connection, of course. distributed/distributed/worker.py Lines 1050 to 1052 in 30f0b60
I'd be surprised if this was configured to 20min anywhere, though |
Reported by a Coiled user
forman
for his recent clusters id 121956 and 121459:User moved past this issue by setting his next cluster to use 4 workers and including
scheduler_options={"idle_timeout": "2 hours"
Odd behavior was observed in worker and scheduler logs that may have lead to the bad state of the cluster. Full scheduler logs attached. For cluster 121956, worker
coiled-dask-forman-121956-worker-35c5f23472,
logs show "Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: distributed.batched - ERROR - Error in batched write
Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: Traceback (most recent call last):
Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: BufferError: Existing exports of data: object cannot be re-sized
Scheduler for the same cluster shows that it removed this worker, and then ran into a stream of
Unexpected worker completed task
errors referencing this same removed worker.Mar 09 13:12:05 ip-10-13-11-186 cloud-init[1528]: distributed.core - INFO - Event loop was unresponsive in Scheduler for 30.75s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Mar 09 13:12:10 ip-10-13-11-186 cloud-init[1528]: distributed.scheduler - INFO - Remove worker <WorkerState 'tls://10.13.8.140:44147', name: coiled-dask-forman-121956-worker-35c5f23472, status: running, memory: 952, processing: 7056>
Mar 09 13:12:12 ip-10-13-11-186 cloud-init[1528]: distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tls://10.13.15.152:42935', name: coiled-dask-forman-121956-worker-1f11f7b175, status: running, memory: 806, processing: 8479>, Got: <WorkerState 'tls://10.13.8.140:44147', name: coiled-dask-forman-121956-worker-35c5f23472, status: running, memory: 0, processing: 0>, Key: ('block-info-_upscale_numpy_array-2ba5457e4d03cf22addd23421859e823', 44, 25)
Possibly related to #5675
Scheduler logs:
forman-scheduler-121956-logs.zip
Task graph of stuck cluster
The text was updated successfully, but these errors were encountered: