Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Cluster Deadlocked #6228

Closed
lojohnson opened this issue Apr 27, 2022 · 2 comments
Closed

Dask Cluster Deadlocked #6228

lojohnson opened this issue Apr 27, 2022 · 2 comments
Assignees
Labels
deadlock The cluster appears to not make any progress

Comments

@lojohnson
Copy link

lojohnson commented Apr 27, 2022

A user experienced an issue where his cluster seemed to deadlock and was no longer doing any work. Overall dashboard showed 6 tasks processing, all on one worker:

dashboard1

dashboard2

All callstacks showed “Task not actively running. It may be finished or not yet started” and there was no meaningful memory/cpu usage by the worker.
Checking the logs, we did show a temporary disconnect from the scheduler:

2022-04-27T16:13:45+0000 cook-init> Started user process: 14
distributed.nanny - INFO -         Start Nanny at: 'tcp://10.57.124.149:31030'
distributed.worker - INFO -       Start worker at:  tcp://10.57.124.149:30030
distributed.worker - INFO -          Listening to:  tcp://10.57.124.149:30030
distributed.worker - INFO -          dashboard at:        10.57.124.149:31300
distributed.worker - INFO - Waiting to connect to:  tcp://172.23.192.48:43379
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                  93.13 GiB
distributed.worker - INFO -       Local Directory: /mnt/sandbox/dask-worker-space/worker-mmwmkoqd
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Starting Worker plugin _WorkerSetupPlugin-c7eaed7e-0d11-418a-a53d-5cca92eef9a9
distributed.worker - INFO -         Registered to:  tcp://172.23.192.48:43379
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 62.27s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.worker - INFO - Connection to scheduler broken.  Reconnecting...
distributed.core - INFO - Event loop was unresponsive in Worker for 9.03s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://10.57.124.149:35132 remote=tcp://172.23.192.48:43379>
Traceback (most recent call last):
  File "/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/distributed/2022/1/0/dist/lib/python3.9/distributed/batched.py", line 93, in _background_send
    nbytes = yield self.comm.write(
  File "/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/tornado/6/1/dist/lib/python3.9/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/distributed/2022/1/0/dist/lib/python3.9/distributed/comm/tcp.py", line 247, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Removing Worker plugin _WorkerSetupPlugin-c7eaed7e-0d11-418a-a53d-5cca92eef9a9
distributed.worker - INFO - Starting Worker plugin _WorkerSetupPlugin-c7eaed7e-0d11-418a-a53d-5cca92eef9a9
distributed.worker - INFO -         Registered to:  tcp://172.23.192.48:43379
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/pandas/1/2/5/dist/lib/python3.9/pandas/core/arraylike.py:358: RuntimeWarning: invalid value encountered in sqrt
  result = getattr(ufunc, method)(*inputs, **kwargs)
/home/var/ts-fuse/artfs-fuse-mount-0/700f4b763772b690495b140bf00486aa5f458325_20220420_141812_856/glibc-2.24-x86_64/ts/modeling/environment/interactive/../../../../ext/public/python/pandas/1/2/5/dist/lib/python3.9/pandas/core/arraylike.py:358: RuntimeWarning: invalid value encountered in sqrt
  result = getattr(ufunc, method)(*inputs, **kwargs)

User was able to connect to the cluster and dump the state (attached).

We’re trying to understand:

  1.   What causes this? My understanding is dask should be robust to this kind of network blip.
    
  2.   How to avoid it in the future? Or recover in other ways besides manually killing the worker.
    
  3.   Any other diagnostics we should have done.
    

Dask/Distributed: 2022.01.0

Python: 3.9

@gjoseph92
Copy link
Collaborator

gjoseph92 commented Apr 27, 2022

@lojohnson I don't see a cluster state dump attached?

My understanding is dask should be robust to this kind of network blip.

Yes, it should be, and in many cases, it is. But this may be a buggy case we've seen before: #5480, #5457

@gjoseph92
Copy link
Collaborator

With worker reconnection removed #6361, this should be resolved in the latest version (2022.5.1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deadlock The cluster appears to not make any progress
Projects
None yet
Development

No branches or pull requests

3 participants