-
-
Notifications
You must be signed in to change notification settings - Fork 750
Closed
Labels
deadlockThe cluster appears to not make any progressThe cluster appears to not make any progress
Description
A user experienced an issue where his cluster seemed to deadlock and was no longer doing any work. Overall dashboard showed 6 tasks processing, all on one worker:
All callstacks showed “Task not actively running. It may be finished or not yet started” and there was no meaningful memory/cpu usage by the worker.
Checking the logs, we did show a temporary disconnect from the scheduler:
2022-04-27T16:13:45+0000 cook-init> Started user process: 14
distributed.nanny - INFO - Start Nanny at: 'tcp://10.57.124.149:31030'
distributed.worker - INFO - Start worker at: tcp://10.57.124.149:30030
distributed.worker - INFO - Listening to: tcp://10.57.124.149:30030
distributed.worker - INFO - dashboard at: 10.57.124.149:31300
distributed.worker - INFO - Waiting to connect to: tcp://172.23.192.48:43379
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 93.13 GiB
distributed.worker - INFO - Local Directory: /mnt/sandbox/dask-worker-space/worker-mmwmkoqd
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Starting Worker plugin _WorkerSetupPlugin-c7eaed7e-0d11-418a-a53d-5cca92eef9a9
distributed.worker - INFO - Registered to: tcp://172.23.192.48:43379
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 62.27s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.core - INFO - Event loop was unresponsive in Worker for 9.03s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://10.57.124.149:35132 remote=tcp://172.23.192.48:43379>
Traceback (most recent call last):
File "/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/distributed/2022/1/0/dist/lib/python3.9/distributed/batched.py", line 93, in _background_send
nbytes = yield self.comm.write(
File "/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/tornado/6/1/dist/lib/python3.9/tornado/gen.py", line 762, in run
value = future.result()
File "/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/distributed/2022/1/0/dist/lib/python3.9/distributed/comm/tcp.py", line 247, in write
raise CommClosedError()
distributed.comm.core.CommClosedError
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Removing Worker plugin _WorkerSetupPlugin-c7eaed7e-0d11-418a-a53d-5cca92eef9a9
distributed.worker - INFO - Starting Worker plugin _WorkerSetupPlugin-c7eaed7e-0d11-418a-a53d-5cca92eef9a9
distributed.worker - INFO - Registered to: tcp://172.23.192.48:43379
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/pandas/1/2/5/dist/lib/python3.9/pandas/core/arraylike.py:358: RuntimeWarning: invalid value encountered in sqrt
result = getattr(ufunc, method)(*inputs, **kwargs)
/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/pandas/1/2/5/dist/lib/python3.9/pandas/core/arraylike.py:358: RuntimeWarning: invalid value encountered in sqrt
result = getattr(ufunc, method)(*inputs, **kwargs)
User was able to connect to the cluster and dump the state (attached).
We’re trying to understand:
-
What causes this? My understanding is dask should be robust to this kind of network blip. -
How to avoid it in the future? Or recover in other ways besides manually killing the worker. -
Any other diagnostics we should have done.
Dask/Distributed: 2022.01.0
Python: 3.9
Metadata
Metadata
Assignees
Labels
deadlockThe cluster appears to not make any progressThe cluster appears to not make any progress

