Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky distributed/tests/test_nanny.py::test_repeated_restarts #6838

Open
gjoseph92 opened this issue Aug 5, 2022 · 0 comments
Open

Flaky distributed/tests/test_nanny.py::test_repeated_restarts #6838

gjoseph92 opened this issue Aug 5, 2022 · 0 comments
Labels
flaky test Intermittent failures on CI.

Comments

@gjoseph92
Copy link
Collaborator

I saw this in #6829 (comment) as well. Good news, it's not caused by that PR, bad news, it seems to actually be flaky. I thought I might have fixed this in #6823, but apparently not. This one is a real normal restart failure: Waited for 2 worker(s) to reconnect after restarting, but after 20s, only 1 have returned.

The Worker process still alive after 15.999998664855958 seconds, killing seems a little concerning. It's possible this test needs to be rewritten after #6504, since before it probably didn't actually care if the workers shut down on time.

cc @hendrikmakait since I'm curious if #6427 would help here—that cleans up the Nanny.kill implementation a bit.

____________________________ test_repeated_restarts ____________________________

args = (), kwds = {}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)

/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/contextlib.py:79: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
distributed/utils_test.py:1111: in test_func
    return _run_and_close_tornado(async_fn_outer)
distributed/utils_test.py:376: in _run_and_close_tornado
    return asyncio.run(inner_fn())
/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/runners.py:44: in run
    return loop.run_until_complete(main)
/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py:634: in run_until_complete
    self.run_forever()
/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py:601: in run_forever
    self._run_once()
/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py:1869: in _run_once
    event_list = self._selector.select(timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <selectors.EpollSelector object at 0x7f3a87ff0f40>, timeout = 0.065

    def select(self, timeout=None):
        if timeout is None:
            timeout = -1
        elif timeout <= 0:
            timeout = 0
        else:
            # epoll_wait() has a resolution of 1 millisecond, round away
            # from zero to wait *at least* timeout seconds.
            timeout = math.ceil(timeout * 1e3) * 1e-3
    
        # epoll_wait() expects `maxevents` to be greater than zero;
        # we want to make sure that `select()` can be called when no
        # FD is registered.
        max_ev = max(len(self._fd_to_key), 1)
    
        ready = []
        try:
>           fd_event_list = self._selector.poll(timeout, max_ev)
E           Failed: Timeout >300.0s

/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/selectors.py:469: Failed
----------------------------- Captured stdout call -----------------------------
Dumped cluster state to test_cluster_dump/test_repeated_restarts.yaml
----------------------------- Captured stderr call -----------------------------
2022-08-05 06:16:59,283 - distributed.scheduler - INFO - State start
2022-08-05 06:16:59,285 - distributed.scheduler - INFO - Clear task state
2022-08-05 06:16:59,286 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:45653
2022-08-05 06:16:59,286 - distributed.scheduler - INFO -   dashboard at:           127.0.0.1:42243
2022-08-05 06:16:59,298 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:35423'
2022-08-05 06:16:59,299 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:34089'
2022-08-05 06:17:00,590 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:42031
2022-08-05 06:17:00,590 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:42031
2022-08-05 06:17:00,590 - distributed.worker - INFO -          dashboard at:            127.0.0.1:44619
2022-08-05 06:17:00,590 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45653
2022-08-05 06:17:00,590 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:00,590 - distributed.worker - INFO -               Threads:                          2
2022-08-05 06:17:00,591 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-08-05 06:17:00,591 - distributed.worker - INFO -       Local Directory: /tmp/dask-worker-space/worker-16dathge
2022-08-05 06:17:00,591 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:00,610 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:37995
2022-08-05 06:17:00,610 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:37995
2022-08-05 06:17:00,611 - distributed.worker - INFO -          dashboard at:            127.0.0.1:37315
2022-08-05 06:17:00,611 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45653
2022-08-05 06:17:00,611 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:00,611 - distributed.worker - INFO -               Threads:                          1
2022-08-05 06:17:00,611 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-08-05 06:17:00,611 - distributed.worker - INFO -       Local Directory: /tmp/dask-worker-space/worker-z6_9djz4
2022-08-05 06:17:00,611 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:01,027 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:42031', name: 1, status: init, memory: 0, processing: 0>
2022-08-05 06:17:01,028 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:42031
2022-08-05 06:17:01,028 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:01,029 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45653
2022-08-05 06:17:01,030 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:01,030 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:01,059 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:37995', name: 0, status: init, memory: 0, processing: 0>
2022-08-05 06:17:01,059 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:37995
2022-08-05 06:17:01,060 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:01,060 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45653
2022-08-05 06:17:01,060 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:01,061 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:01,090 - distributed.scheduler - INFO - Receive client connection: Client-37aff111-1486-11ed-8acf-000d3aec3edc
2022-08-05 06:17:01,091 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:01,095 - distributed.scheduler - INFO - Releasing all requested keys
2022-08-05 06:17:01,095 - distributed.scheduler - INFO - Clear task state
2022-08-05 06:17:01,108 - distributed.nanny - INFO - Nanny asking worker to close
2022-08-05 06:17:01,109 - distributed.nanny - INFO - Nanny asking worker to close
2022-08-05 06:17:01,111 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:37995
2022-08-05 06:17:01,112 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:42031
2022-08-05 06:17:01,113 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-c2d1f8e0-49c4-4e4e-bb40-2c755ee8617d Address tcp://127.0.0.1:37995 Status: Status.closing
2022-08-05 06:17:01,114 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-17093a9c-4d9b-472f-89df-0c14cfb9fe80 Address tcp://127.0.0.1:42031 Status: Status.closing
2022-08-05 06:17:01,128 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:37995', name: 0, status: closing, memory: 0, processing: 0>
2022-08-05 06:17:01,128 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:37995
2022-08-05 06:17:01,129 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:42031', name: 1, status: closing, memory: 0, processing: 0>
2022-08-05 06:17:01,129 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:42031
2022-08-05 06:17:01,129 - distributed.scheduler - INFO - Lost all workers
2022-08-05 06:17:01,284 - distributed.nanny - WARNING - Restarting worker
2022-08-05 06:17:02,547 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:38789
2022-08-05 06:17:02,547 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:38789
2022-08-05 06:17:02,547 - distributed.worker - INFO -          dashboard at:            127.0.0.1:42551
2022-08-05 06:17:02,547 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45653
2022-08-05 06:17:02,547 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:02,547 - distributed.worker - INFO -               Threads:                          1
2022-08-05 06:17:02,547 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-08-05 06:17:02,548 - distributed.worker - INFO -       Local Directory: /tmp/dask-worker-space/worker-gdowzjl6
2022-08-05 06:17:02,548 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:02,983 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:38789', name: 0, status: init, memory: 0, processing: 0>
2022-08-05 06:17:02,984 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:38789
2022-08-05 06:17:02,985 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:02,985 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45653
2022-08-05 06:17:02,985 - distributed.worker - INFO - -------------------------------------------------
2022-08-05 06:17:02,986 - distributed.core - INFO - Starting established connection
2022-08-05 06:17:17,132 - distributed.nanny - WARNING - Worker process still alive after 15.999998664855958 seconds, killing
2022-08-05 06:17:21,152 - distributed.core - ERROR - Waited for 2 worker(s) to reconnect after restarting, but after 20s, only 1 have returned. Consider a longer timeout, or `wait_for_workers=False`.
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/utils.py", line 799, in wrapper
    return await func(*args, **kwargs)
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 5288, in restart
    raise TimeoutError(msg) from None
asyncio.exceptions.TimeoutError: Waited for 2 worker(s) to reconnect after restarting, but after 20s, only 1 have returned. Consider a longer timeout, or `wait_for_workers=False`.
2022-08-05 06:17:21,153 - distributed.core - ERROR - Exception while handling op restart
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 769, in _handle_comm
    result = await result
  File "/home/runner/work/distributed/distributed/distributed/utils.py", line 799, in wrapper
    return await func(*args, **kwargs)
  File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 5288, in restart
    raise TimeoutError(msg) from None
asyncio.exceptions.TimeoutError: Waited for 2 worker(s) to reconnect after restarting, but after 20s, only 1 have returned. Consider a longer timeout, or `wait_for_workers=False`.
2022-08-05 06:17:59,283 - distributed.scheduler - INFO - Remove client Client-37aff111-1486-11ed-8acf-000d3aec3edc
2022-08-05 06:17:59,283 - distributed.scheduler - INFO - Remove client Client-37aff111-1486-11ed-8acf-000d3aec3edc
2022-08-05 06:17:59,284 - distributed.scheduler - INFO - Close client connection: Client-37aff111-1486-11ed-8acf-000d3aec3edc
2022-08-05 06:17:59,285 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:35423'.
2022-08-05 06:17:59,285 - distributed.nanny - INFO - Nanny asking worker to close
2022-08-05 06:17:59,286 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:34089'.
2022-08-05 06:17:59,287 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:38789
2022-08-05 06:17:59,289 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-18510263-0ed9-4644-89db-405e9a768c10 Address tcp://127.0.0.1:38789 Status: Status.closing
2022-08-05 06:17:59,289 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:38789', name: 0, status: closing, memory: 0, processing: 0>
2022-08-05 06:17:59,289 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:38789
2022-08-05 06:17:59,289 - distributed.scheduler - INFO - Lost all workers

+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++

 Stack of AsyncProcess Dask Worker process (from Nanny) watch message queue ([1398](https://github.com/dask/distributed/runs/7685981380?check_suite_focus=true#step:11:1399)87856572160) 
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/runner/work/distributed/distributed/distributed/process.py", line 216, in _watch_message_queue
    msg = q.get()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/queue.py", line 171, in get
    self.not_empty.wait()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()

~~~~~~~~~~~~~~~~~~~~~ Stack of asyncio_1 (139888427005696) ~~~~~~~~~~~~~~~~~~~~~
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 81, in _worker
    work_item = work_queue.get(block=True)

~~~~~~~~~~~~~~~~~~~~~ Stack of asyncio_0 (139888158570240) ~~~~~~~~~~~~~~~~~~~~~
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 81, in _worker
    work_item = work_queue.get(block=True)

~~~~~~~~~~~~~~ Stack of Dask-Callback-Thread_0 (139887873353472) ~~~~~~~~~~~~~~~
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 81, in _worker
    work_item = work_queue.get(block=True)

~~~~~~~~~~~~~~~~~~ Stack of Dask-Offload_0 (139889176532736) ~~~~~~~~~~~~~~~~~~~
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 81, in _worker
    work_item = work_queue.get(block=True)

https://github.com/dask/distributed/runs/7685981380?check_suite_focus=true#step:11:1327

@gjoseph92 gjoseph92 added the flaky test Intermittent failures on CI. label Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

No branches or pull requests

1 participant