-
-
Notifications
You must be signed in to change notification settings - Fork 747
Description
I saw this in #6829 (comment) as well. Good news, it's not caused by that PR, bad news, it seems to actually be flaky. I thought I might have fixed this in #6823, but apparently not. This one is a real normal restart failure: Waited for 2 worker(s) to reconnect after restarting, but after 20s, only 1 have returned.
The Worker process still alive after 15.999998664855958 seconds, killing seems a little concerning. It's possible this test needs to be rewritten after #6504, since before it probably didn't actually care if the workers shut down on time.
cc @hendrikmakait since I'm curious if #6427 would help here—that cleans up the Nanny.kill implementation a bit.
____________________________ test_repeated_restarts ____________________________
args = (), kwds = {}
@wraps(func)
def inner(*args, **kwds):
with self._recreate_cm():
> return func(*args, **kwds)
/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/contextlib.py:79:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
distributed/utils_test.py:1111: in test_func
return _run_and_close_tornado(async_fn_outer)
distributed/utils_test.py:376: in _run_and_close_tornado
return asyncio.run(inner_fn())
/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/runners.py:44: in run
return loop.run_until_complete(main)
/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py:634: in run_until_complete
self.run_forever()
/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py:601: in run_forever
self._run_once()
/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py:1869: in _run_once
event_list = self._selector.select(timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <selectors.EpollSelector object at 0x7f3a87ff0f40>, timeout = 0.065
def select(self, timeout=None):
if timeout is None:
timeout = -1
elif timeout <= 0:
timeout = 0
else:
# epoll_wait() has a resolution of 1 millisecond, round away
# from zero to wait *at least* timeout seconds.
timeout = math.ceil(timeout * 1e3) * 1e-3
# epoll_wait() expects `maxevents` to be greater than zero;
# we want to make sure that `select()` can be called when no
# FD is registered.
max_ev = max(len(self._fd_to_key), 1)
ready = []
try:
> fd_event_list = self._selector.poll(timeout, max_ev)
E Failed: Timeout >300.0s
/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/selectors.py:469: Failed
----------------------------- Captured stdout call -----------------------------
Dumped cluster state to test_cluster_dump/test_repeated_restarts.yaml
----------------------------- Captured stderr call -----------------------------
2022-08-05 06:16:59,283 - distributed.scheduler - INFO - State start
2022-08-05 06:16:59,285 - distributed.scheduler - INFO - Clear task state
2022-08-05 06:16:59,286 - distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:45653
2022-08-05 06:16:59,286 - distributed.scheduler - INFO - dashboard at: 127.0.0.1:42243
2022-08-05 06:16:59,298 - distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:35423'
2022-08-05 06:16:59,299 - distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:34089'
2022-08-05 06:17:00,590 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:42031
2022-08-05 06:17:00,590 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:42031
2022-08-05 06:17:00,590 - distributed.worker - INFO - dashboard at: 127.0.0.1:44619
2022-08-05 06:17:00,590 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:45653
2022-08-05 06:17:00,590 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:00,590 - distributed.worker - INFO - Threads: 2
2022-08-05 06:17:00,591 - distributed.worker - INFO - Memory: 6.78 GiB
2022-08-05 06:17:00,591 - distributed.worker - INFO - Local Directory: /tmp/dask-worker-space/worker-16dathge
2022-08-05 06:17:00,591 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:00,610 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:37995
2022-08-05 06:17:00,610 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:37995
2022-08-05 06:17:00,611 - distributed.worker - INFO - dashboard at: 127.0.0.1:37315
2022-08-05 06:17:00,611 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:45653
2022-08-05 06:17:00,611 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:00,611 - distributed.worker - INFO - Threads: 1
2022-08-05 06:17:00,611 - distributed.worker - INFO - Memory: 6.78 GiB
2022-08-05 06:17:00,611 - distributed.worker - INFO - Local Directory: /tmp/dask-worker-space/worker-z6_9djz4
2022-08-05 06:17:00,611 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:01,027 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:42031', name: 1, status: init, memory: 0, processing: 0>
2022-08-05 06:17:01,028 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:42031
2022-08-05 06:17:01,028 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:01,029 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:45653
2022-08-05 06:17:01,030 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:01,030 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:01,059 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:37995', name: 0, status: init, memory: 0, processing: 0>
2022-08-05 06:17:01,059 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:37995
2022-08-05 06:17:01,060 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:01,060 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:45653
2022-08-05 06:17:01,060 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:01,061 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:01,090 - distributed.scheduler - INFO - Receive client connection: Client-37aff111-1486-11ed-8acf-000d3aec3edc
2022-08-05 06:17:01,091 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:01,095 - distributed.scheduler - INFO - Releasing all requested keys
2022-08-05 06:17:01,095 - distributed.scheduler - INFO - Clear task state
2022-08-05 06:17:01,108 - distributed.nanny - INFO - Nanny asking worker to close
2022-08-05 06:17:01,109 - distributed.nanny - INFO - Nanny asking worker to close
2022-08-05 06:17:01,111 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:37995
2022-08-05 06:17:01,112 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:42031
2022-08-05 06:17:01,113 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-c2d1f8e0-49c4-4e4e-bb40-2c755ee8617d Address tcp://127.0.0.1:37995 Status: Status.closing
2022-08-05 06:17:01,114 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-17093a9c-4d9b-472f-89df-0c14cfb9fe80 Address tcp://127.0.0.1:42031 Status: Status.closing
2022-08-05 06:17:01,128 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:37995', name: 0, status: closing, memory: 0, processing: 0>
2022-08-05 06:17:01,128 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:37995
2022-08-05 06:17:01,129 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:42031', name: 1, status: closing, memory: 0, processing: 0>
2022-08-05 06:17:01,129 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:42031
2022-08-05 06:17:01,129 - distributed.scheduler - INFO - Lost all workers
2022-08-05 06:17:01,284 - distributed.nanny - WARNING - Restarting worker
2022-08-05 06:17:02,547 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:38789
2022-08-05 06:17:02,547 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:38789
2022-08-05 06:17:02,547 - distributed.worker - INFO - dashboard at: 127.0.0.1:42551
2022-08-05 06:17:02,547 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:45653
2022-08-05 06:17:02,547 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:02,547 - distributed.worker - INFO - Threads: 1
2022-08-05 06:17:02,547 - distributed.worker - INFO - Memory: 6.78 GiB
2022-08-05 06:17:02,548 - distributed.worker - INFO - Local Directory: /tmp/dask-worker-space/worker-gdowzjl6
2022-08-05 06:17:02,548 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:02,983 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:38789', name: 0, status: init, memory: 0, processing: 0>
2022-08-05 06:17:02,984 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:38789
2022-08-05 06:17:02,985 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:02,985 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:45653
2022-08-05 06:17:02,985 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:02,986 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:17,132 - distributed.nanny - WARNING - Worker process still alive after 15.999998664855958 seconds, killing
2022-08-05 06:17:21,152 - distributed.core - ERROR - Waited for 2 worker(s) to reconnect after restarting, but after 20s, only 1 have returned. Consider a longer timeout, or `wait_for_workers=False`.
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/utils.py", line 799, in wrapper
return await func(*args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 5288, in restart
raise TimeoutError(msg) from None
asyncio.exceptions.TimeoutError: Waited for 2 worker(s) to reconnect after restarting, but after 20s, only 1 have returned. Consider a longer timeout, or `wait_for_workers=False`.
2022-08-05 06:17:21,153 - distributed.core - ERROR - Exception while handling op restart
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/core.py", line 769, in _handle_comm
result = await result
File "/home/runner/work/distributed/distributed/distributed/utils.py", line 799, in wrapper
return await func(*args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 5288, in restart
raise TimeoutError(msg) from None
asyncio.exceptions.TimeoutError: Waited for 2 worker(s) to reconnect after restarting, but after 20s, only 1 have returned. Consider a longer timeout, or `wait_for_workers=False`.
2022-08-05 06:17:59,283 - distributed.scheduler - INFO - Remove client Client-37aff111-1486-11ed-8acf-000d3aec3edc
2022-08-05 06:17:59,283 - distributed.scheduler - INFO - Remove client Client-37aff111-1486-11ed-8acf-000d3aec3edc
2022-08-05 06:17:59,284 - distributed.scheduler - INFO - Close client connection: Client-37aff111-1486-11ed-8acf-000d3aec3edc
2022-08-05 06:17:59,285 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:35423'.
2022-08-05 06:17:59,285 - distributed.nanny - INFO - Nanny asking worker to close
2022-08-05 06:17:59,286 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:34089'.
2022-08-05 06:17:59,287 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:38789
2022-08-05 06:17:59,289 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-18510263-0ed9-4644-89db-405e9a768c10 Address tcp://127.0.0.1:38789 Status: Status.closing
2022-08-05 06:17:59,289 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:38789', name: 0, status: closing, memory: 0, processing: 0>
2022-08-05 06:17:59,289 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:38789
2022-08-05 06:17:59,289 - distributed.scheduler - INFO - Lost all workers
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
Stack of AsyncProcess Dask Worker process (from Nanny) watch message queue ([1398](https://github.com/dask/distributed/runs/7685981380?check_suite_focus=true#step:11:1399)87856572160)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
self._bootstrap_inner()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/runner/work/distributed/distributed/distributed/process.py", line 216, in _watch_message_queue
msg = q.get()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/queue.py", line 171, in get
self.not_empty.wait()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 312, in wait
waiter.acquire()
~~~~~~~~~~~~~~~~~~~~~ Stack of asyncio_1 (139888427005696) ~~~~~~~~~~~~~~~~~~~~~
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
self._bootstrap_inner()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 81, in _worker
work_item = work_queue.get(block=True)
~~~~~~~~~~~~~~~~~~~~~ Stack of asyncio_0 (139888158570240) ~~~~~~~~~~~~~~~~~~~~~
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
self._bootstrap_inner()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 81, in _worker
work_item = work_queue.get(block=True)
~~~~~~~~~~~~~~ Stack of Dask-Callback-Thread_0 (139887873353472) ~~~~~~~~~~~~~~~
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
self._bootstrap_inner()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 81, in _worker
work_item = work_queue.get(block=True)
~~~~~~~~~~~~~~~~~~ Stack of Dask-Offload_0 (139889176532736) ~~~~~~~~~~~~~~~~~~~
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
self._bootstrap_inner()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 81, in _worker
work_item = work_queue.get(block=True)
https://github.com/dask/distributed/runs/7685981380?check_suite_focus=true#step:11:1327