-
-
Notifications
You must be signed in to change notification settings - Fork 747
Open
Labels
bugSomething is brokenSomething is brokenflaky testIntermittent failures on CI.Intermittent failures on CI.
Description
It looks like test_chaos_rechunk started failing for the first time today: https://dask.github.io/distributed/test_report.html, https://github.com/dask/distributed/actions/runs/2461256827. The failure is a validation in transition_flight_missing:
def transition_flight_missing(
self, ts: TaskState, *, stimulus_id: str
) -> RecsInstrs:
> assert ts.done
E AssertionErrorI also had this fail in CI for my PR in the same way: https://github.com/dask/distributed/runs/6798289115
Here's stderr from one of the tests:
deftransition_flight_missing(
self, ts: TaskState, *, stimulus_id: str
) -> RecsInstrs:
> assert ts.done
E AssertionError
distributed/worker.py:2098: AssertionError
----------------------------- Captured stdout call -----------------------------
Failed worker tcp://127.0.0.1:43415
----------------------------- Captured stderr call -----------------------------
2022-06-08 12:07:20,441 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:37429
2022-06-08 12:07:20,442 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:37429
2022-06-08 12:07:20,442 - distributed.worker - INFO - dashboard at: 127.0.0.1:41491
2022-06-08 12:07:20,442 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,442 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,442 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,442 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,442 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-gz003weh
2022-06-08 12:07:20,442 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,555 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:36149
2022-06-08 12:07:20,555 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:36149
2022-06-08 12:07:20,555 - distributed.worker - INFO - dashboard at: 127.0.0.1:46843
2022-06-08 12:07:20,555 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,555 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,555 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,555 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,555 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-b5f1xbmq
2022-06-08 12:07:20,555 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,588 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:40519
2022-06-08 12:07:20,588 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:40519
2022-06-08 12:07:20,588 - distributed.worker - INFO - dashboard at: 127.0.0.1:39555
2022-06-08 12:07:20,588 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,588 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,588 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,588 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,588 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-pjuqkoxn
2022-06-08 12:07:20,588 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,607 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:35611
2022-06-08 12:07:20,607 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:35611
2022-06-08 12:07:20,607 - distributed.worker - INFO - dashboard at: 127.0.0.1:34351
2022-06-08 12:07:20,607 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,607 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,607 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,608 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,608 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-6f46yon2
2022-06-08 12:07:20,608 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,644 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:33785
2022-06-08 12:07:20,645 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:33785
2022-06-08 12:07:20,645 - distributed.worker - INFO - dashboard at: 127.0.0.1:45091
2022-06-08 12:07:20,645 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,645 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,645 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,645 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,645 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-5tt6nwgi
2022-06-08 12:07:20,645 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,652 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:43415
2022-06-08 12:07:20,652 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:43415
2022-06-08 12:07:20,652 - distributed.worker - INFO - dashboard at: 127.0.0.1:45945
2022-06-08 12:07:20,652 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,652 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,652 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,652 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,652 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-szyisuy3
2022-06-08 12:07:20,652 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,585 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,585 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,586 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,593 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,594 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,594 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,614 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,615 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,615 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,642 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,642 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,643 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,685 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,686 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,687 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,689 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,690 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,690 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,716 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,716 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,718 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,718 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,720 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,720 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:22,608 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:23,195 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:23,837 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:24,763 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:26,070 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:26,145 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:26,811 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:33785 -> tcp://127.0.0.1:35611
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:33785 remote=tcp://127.0.0.1:51330>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,813 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:35611
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 236, in read
n = await stream.read_into(chunk)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 150, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:57702 remote=tcp://127.0.0.1:35611>: Stream is closed
2022-06-08 12:07:26,810 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:43415 -> tcp://127.0.0.1:35611
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53242>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,816 - distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:53242'
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/core.py", line 597, in handle_comm
result = await result
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53242>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,817 - distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:51330'
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/core.py", line 597, in handle_comm
result = await result
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:33785 remote=tcp://127.0.0.1:51330>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,828 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:35611
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:57706 remote=tcp://127.0.0.1:35611>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,870 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36149
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:43080 remote=tcp://127.0.0.1:36149>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,870 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36149
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:43078 remote=tcp://127.0.0.1:36149>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:27,229 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:27,262 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:27,364 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:41963
2022-06-08 12:07:27,364 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:41963
2022-06-08 12:07:27,364 - distributed.worker - INFO - dashboard at: 127.0.0.1:[3998](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:3999)7
2022-06-08 12:07:27,364 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:27,364 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:27,364 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:27,364 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:27,364 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-tgv53yl3
2022-06-08 12:07:27,365 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:28,775 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:44643
2022-06-08 12:07:28,778 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:44643
2022-06-08 12:07:28,778 - distributed.worker - INFO - dashboard at: 127.0.0.1:38389
2022-06-08 12:07:28,778 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:28,778 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:28,778 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:28,778 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:28,778 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-msra9yft
2022-06-08 12:07:28,778 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:29,003 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:29,003 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:29,004 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:29,004 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:30,115 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:30,115 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:30,115 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:30,116 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:30,865 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:31,085 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:42643
2022-06-08 12:07:31,085 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:42643
2022-06-08 12:07:31,085 - distributed.worker - INFO - dashboard at: 127.0.0.1:35729
2022-06-08 12:07:31,085 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:31,086 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,086 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:31,086 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:31,086 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-u_1l8ay9
2022-06-08 12:07:31,086 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,094 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:43767
2022-06-08 12:07:31,094 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:43767
2022-06-08 12:07:31,095 - distributed.worker - INFO - dashboard at: 127.0.0.1:34993
2022-06-08 12:07:31,095 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:31,095 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,095 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:31,095 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:31,095 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-kh1fk7cc
2022-06-08 12:07:31,095 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,287 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:43415 -> tcp://127.0.0.1:33785
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53240>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,287 - distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:53240'
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/core.py", line 597, in handle_comm
result = await result
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53240>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,288 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33785
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:51332 remote=tcp://127.0.0.1:33785>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,289 - distributed.worker - ERROR -
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
return await func(*args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3393, in gather_dep
self.transitions(recommendations, stimulus_id=stimulus_id)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2840, in transitions
a_recs, a_instructions = self._transition(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2773, in _transition
recs, instructions = func(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2098, in transition_flight_missing
assert ts.done
AssertionError
2022-06-08 12:07:31,292 - distributed.worker - ERROR -
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 193, in wrapper
return await method(self, *args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
return await func(*args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3393, in gather_dep
self.transitions(recommendations, stimulus_id=stimulus_id)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2840, in transitions
a_recs, a_instructions = self._transition(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2773, in _transition
recs, instructions = func(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2098, in transition_flight_missing
assert ts.done
AssertionError
2022-06-08 12:07:31,292 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:43415
2022-06-08 12:07:31,292 - distributed.worker - INFO - Not waiting on executor to close
2022-06-08 12:07:31,295 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33785
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:51338 remote=tcp://127.0.0.1:33785>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,297 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33785
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:51336 remote=tcp://127.0.0.1:33785>: ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
return await func(*args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1532, in close
_, pending = await asyncio.wait(self._async_instructions, timeout=timeout)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError
2022-06-08 12:07:31,410 - distributed.worker - ERROR -
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
return await func(*args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4588, in _get_data
comm = await rpc.connect(worker)
File "/home/runner/work/distributed/distributed/distributed/core.py", line 1193, in connect
await done.wait()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError
2022-06-08 12:07:31,410 - distributed.worker - CRITICAL - Error trying close worker in response to broken internal state. Forcibly exiting worker NOW
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 225, in _force_close
await asyncio.wait_for(self.close(nanny=False, executor_wait=False), 30)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/tasks.py", line 432, in wait_for
await waiter
asyncio.exceptions.CancelledError
2022-06-08 12:07:31,433 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:31,497 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:31,878 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:32,147 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:32,429 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:32,430 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:32,430 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:32,430 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:33,546 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:33,546 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:33,546 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:33,546 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:33,610 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:43767
2022-06-08 12:07:33,611 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-dd7d2c45-b24a-4c24-befd-357a020e0609 Address tcp://127.0.0.1:43767 Status: Status.closing
2022-06-08 12:07:33,650 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:42643
2022-06-08 12:07:33,652 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-45028c9a-7a2f-4bd0-8bfb-f735f60d8d1c Address tcp://127.0.0.1:42643 Status: Status.closing
2022-06-08 12:07:33,906 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:[4196](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4197)3
2022-06-08 12:07:33,934 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-b0e2dbe7-7ea9-4457-be3e-5eaf0ad02333 Address tcp://127.0.0.1:41963 Status: Status.closing
2022-06-08 12:07:34,555 - distributed.worker - INFO - Stopping worker
2022-06-08 12:07:34,556 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2022-06-08 12:07:34,569 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:46495
2022-06-08 12:07:34,569 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:46495
2022-06-08 12:07:34,569 - distributed.worker - INFO - dashboard at: 127.0.0.1:45525
2022-06-08 12:07:34,569 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:34,569 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,569 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:34,569 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:34,569 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-9em9f50m
2022-06-08 12:07:34,569 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,630 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:[4322](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4323)7
2022-06-08 12:07:34,630 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:43227
2022-06-08 12:07:34,630 - distributed.worker - INFO - dashboard at: 127.0.0.1:38507
2022-06-08 12:07:34,630 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:[4360](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4361)3
2022-06-08 12:07:34,630 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,630 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:34,630 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:34,630 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-vptuzvhr
2022-06-08 12:07:34,630 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,633 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:43227
2022-06-08 12:07:34,634 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2022-06-08 12:07:34,704 - distributed.worker - INFO - Stopping worker
2022-06-08 12:07:34,704 - distributed.worker - INFO - Closed worker has not yet started: Status.init
@crusaderky @fjetter what might have landed recently that could have affected this?
Metadata
Metadata
Assignees
Labels
bugSomething is brokenSomething is brokenflaky testIntermittent failures on CI.Intermittent failures on CI.