You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A user experienced an issue where his cluster seemed to deadlock and was no longer doing any work. Overall dashboard showed 6 tasks processing, all on one worker:
All callstacks showed “Task not actively running. It may be finished or not yet started” and there was no meaningful memory/cpu usage by the worker.
Checking the logs, we did show a temporary disconnect from the scheduler:
2022-04-27T16:13:45+0000 cook-init> Started user process: 14
distributed.nanny - INFO - Start Nanny at: 'tcp://10.57.124.149:31030'
distributed.worker - INFO - Start worker at: tcp://10.57.124.149:30030
distributed.worker - INFO - Listening to: tcp://10.57.124.149:30030
distributed.worker - INFO - dashboard at: 10.57.124.149:31300
distributed.worker - INFO - Waiting to connect to: tcp://172.23.192.48:43379
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 93.13 GiB
distributed.worker - INFO - Local Directory: /mnt/sandbox/dask-worker-space/worker-mmwmkoqd
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Starting Worker plugin _WorkerSetupPlugin-c7eaed7e-0d11-418a-a53d-5cca92eef9a9
distributed.worker - INFO - Registered to: tcp://172.23.192.48:43379
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 62.27s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.core - INFO - Event loop was unresponsive in Worker for 9.03s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://10.57.124.149:35132 remote=tcp://172.23.192.48:43379>
Traceback (most recent call last):
File "/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/distributed/2022/1/0/dist/lib/python3.9/distributed/batched.py", line 93, in _background_send
nbytes = yield self.comm.write(
File "/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/tornado/6/1/dist/lib/python3.9/tornado/gen.py", line 762, in run
value = future.result()
File "/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/distributed/2022/1/0/dist/lib/python3.9/distributed/comm/tcp.py", line 247, in write
raise CommClosedError()
distributed.comm.core.CommClosedError
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Removing Worker plugin _WorkerSetupPlugin-c7eaed7e-0d11-418a-a53d-5cca92eef9a9
distributed.worker - INFO - Starting Worker plugin _WorkerSetupPlugin-c7eaed7e-0d11-418a-a53d-5cca92eef9a9
distributed.worker - INFO - Registered to: tcp://172.23.192.48:43379
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/pandas/1/2/5/dist/lib/python3.9/pandas/core/arraylike.py:358: RuntimeWarning: invalid value encountered in sqrt
result = getattr(ufunc, method)(*inputs, **kwargs)
/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/pandas/1/2/5/dist/lib/python3.9/pandas/core/arraylike.py:358: RuntimeWarning: invalid value encountered in sqrt
result = getattr(ufunc, method)(*inputs, **kwargs)
User was able to connect to the cluster and dump the state (attached).
We’re trying to understand:
What causes this? My understanding is dask should be robust to this kind of network blip.
How to avoid it in the future? Or recover in other ways besides manually killing the worker.
Any other diagnostics we should have done.
Dask/Distributed: 2022.01.0
Python: 3.9
The text was updated successfully, but these errors were encountered:
A user experienced an issue where his cluster seemed to deadlock and was no longer doing any work. Overall dashboard showed 6 tasks processing, all on one worker:
All callstacks showed “Task not actively running. It may be finished or not yet started” and there was no meaningful memory/cpu usage by the worker.
Checking the logs, we did show a temporary disconnect from the scheduler:
User was able to connect to the cluster and dump the state (attached).
We’re trying to understand:
Dask/Distributed: 2022.01.0
Python: 3.9
The text was updated successfully, but these errors were encountered: