Skip to content

test_bad_disk flaky #7208

Closed
Closed
@jrbourbeau

Description

@jrbourbeau

distributed/shuffle/tests/test_shuffle.py::test_bad_disk has started failing on main with the traceback below. See this CI run for an example.

________________________________ test_bad_disk _________________________________
1 thread(s) were leaked from test

------ Call stack of leaked thread 1/1: <Thread(ThreadPoolExecutor-69_0, started 140170015799040)> ------
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
	self._bootstrap_inner()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
	self.run()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
	self._target(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 81, in _worker
	work_item = work_queue.get(block=True)
----------------------------- Captured stderr call -----------------------------
2022-10-27 14:34:53,950 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 7)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 7, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,951 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 1)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 1, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,955 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 0)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 0, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,987 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 2)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 2, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,987 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 5)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 5, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,987 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 3)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 3, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,990 - distributed.worker - ERROR - Exception during execution of task ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 4).
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2341, in _prepare_args_for_execution
    data[k] = self.data[k]
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/zict/buffer.py", line 108, in __getitem__
    raise KeyError(key)
KeyError: 'shuffle-barrier-3110a8a90a5b642409b0a20f83b03722'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2239, in execute
    args2, kwargs2 = self._prepare_args_for_execution(ts, args, kwargs)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2345, in _prepare_args_for_execution
    data[k] = Actor(type(self.state.actors[k]), self.address, k, self)
KeyError: 'shuffle-barrier-3110a8a90a5b642409b0a20f83b03722'
2022-10-27 14:34:53,996 - distributed.diskutils - ERROR - Failed to remove '/tmp/dask-worker-space/worker-dt9g6wgx' (failed in <built-in function lstat>): [Errno 2] No such file or directory: '/tmp/dask-worker-space/worker-dt9g6wgx'
2022-10-27 14:34:53,996 - distributed.diskutils - ERROR - Failed to remove '/tmp/dask-worker-space/worker-ihvksve8' (failed in <built-in function lstat>): [Errno 2] No such file or directory: '/tmp/dask-worker-space/worker-ihvksve8'
------------------------------ Captured log call -------------------------------
ERROR    asyncio:base_events.py:1753 Task exception was never retrieved
future: <Task finished name='Task-65302' coro=<Shuffle.receive() done, defined at /home/runner/work/distributed/distributed/distributed/shuffle/_shuffle_extension.py:142> exception=FileNotFoundError(2, 'No such file or directory')>
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/shuffle/_shuffle_extension.py", line 148, in receive
    raise self._exception
  File "/home/runner/work/distributed/distributed/distributed/shuffle/_shuffle_extension.py", line 172, in receive
    await self.multi_file.put(groups)
  File "/home/runner/work/distributed/distributed/distributed/shuffle/_multi_file.py", line 124, in put
    raise self._exception
  File "/home/runner/work/distributed/distributed/distributed/shuffle/_multi_file.py", line 202, in process
    with open(
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/dask-worker-space/worker-ihvksve8/shuffle-3110a8a90a5b642409b0a20f83b03722/1'
- generated xml file: /home/runner/work/distributed/distributed/reports/pytest.xml -

cc @fjetter as I know you've made some shuffle-related changes recently (not sure if they're related though)

Metadata

Metadata

Assignees

No one assigned

    Labels

    flaky testIntermittent failures on CI.testsUnit tests and/or continuous integration

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions