Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test_upload_directory: assert self._state.pid is not None, "can only join a started process" #6973

Open
gjoseph92 opened this issue Aug 30, 2022 · 0 comments
Labels
flaky test Intermittent failures on CI.

Comments

@gjoseph92
Copy link
Collaborator

Very related to #6972 (though that won't fix it).

In this test, upload_directory tries to restart the worker. The restart fails for some reason, which then leads to another failure in WorkerProcess.kill. Somehow the AsyncProcess instance gets swapped out with a fresh one that has no PID?

____________________________ test_upload_directory _____________________________

c = <Client: 'tcp://127.0.0.1:55556' processes=0 threads=0, memory=0 B>
s = <Scheduler 'tcp://127.0.0.1:55556', workers: 0, cores: 0, tasks: 0>
a = <Nanny: None, threads: 1>, b = <Nanny: None, threads: 2>
tmp_path = PosixPath('/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pytest-of-runner/pytest-0/test_upload_directory0')

    @gen_cluster(client=True, Worker=Nanny)
    async def test_upload_directory(c, s, a, b, tmp_path):
        from dask.distributed import UploadDirectory
    
        # Be sure to exclude code coverage reports
        files_start = {f for f in os.listdir() if not f.startswith(".coverage")}
    
        with open(tmp_path / "foo.py", "w") as f:
            f.write("x = 123")
        with open(tmp_path / "bar.py", "w") as f:
            f.write("from foo import x")
    
        plugin = UploadDirectory(tmp_path, restart=True, update_path=True)
        await c.register_worker_plugin(plugin)
    
        [name] = a.plugins
        assert os.path.split(tmp_path)[-1] in name
    
        def f():
            import bar
    
            return bar.x
    
        results = await c.run(f)
>       assert results[a.worker_address] == 123
E       KeyError: 'tcp://127.0.0.1:55563'

distributed/tests/test_client.py:7046: KeyError
----------------------------- Captured stdout call -----------------------------
Dumped cluster state to test_cluster_dump/test_upload_directory.yaml
----------------------------- Captured stderr call -----------------------------
2022-08-29 22:49:40,038 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:55563
2022-08-29 22:49:40,039 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:55563
2022-08-29 22:49:40,039 - distributed.worker - INFO -           Worker name:                          0
2022-08-29 22:49:40,039 - distributed.worker - INFO -          dashboard at:            127.0.0.1:55565
2022-08-29 22:49:40,039 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:55556
2022-08-29 22:49:40,039 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,040 - distributed.worker - INFO -               Threads:                          1
2022-08-29 22:49:40,040 - distributed.worker - INFO -                Memory:                  14.00 GiB
2022-08-29 22:49:40,040 - distributed.worker - INFO -       Local Directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/dask-worker-space/worker-7tby39oh
2022-08-29 22:49:40,040 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,048 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:55564
2022-08-29 22:49:40,049 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:55564
2022-08-29 22:49:40,049 - distributed.worker - INFO -           Worker name:                          1
2022-08-29 22:49:40,049 - distributed.worker - INFO -          dashboard at:            127.0.0.1:55567
2022-08-29 22:49:40,049 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:55556
2022-08-29 22:49:40,049 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,049 - distributed.worker - INFO -               Threads:                          2
2022-08-29 22:49:40,049 - distributed.worker - INFO -                Memory:                  14.00 GiB
2022-08-29 22:49:40,050 - distributed.worker - INFO -       Local Directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/dask-worker-space/worker-xvsbrqvh
2022-08-29 22:49:40,050 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,937 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:55556
2022-08-29 22:49:40,938 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,939 - distributed.core - INFO - Starting established connection
2022-08-29 22:49:40,944 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:55556
2022-08-29 22:49:40,945 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,947 - distributed.core - INFO - Starting established connection
2022-08-29 22:49:41,020 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:55563
2022-08-29 22:49:41,021 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:55564
2022-08-29 22:49:41,024 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-682279b9-b660-4e16-9224-900996085633 Address tcp://127.0.0.1:55564 Status: Status.closing
2022-08-29 22:49:41,024 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-afb76a22-adab-416a-9920-40dd76207d8d Address tcp://127.0.0.1:55563 Status: Status.closing
2022-08-29 22:49:42,301 - distributed.nanny - WARNING - Worker process still alive after 1.2799989318847658 seconds, killing
2022-08-29 22:49:44,265 - distributed.nanny - WARNING - Worker process still alive after 1.279999237060547 seconds, killing
2022-08-29 22:49:44,272 - distributed.nanny - ERROR - Restart timed out after 30s; returning before finished
2022-08-29 22:49:44,273 - distributed.nanny - ERROR - Restart timed out after 30s; returning before finished
2022-08-29 22:49:44,365 - distributed.nanny - WARNING - Restarting worker
2022-08-29 22:49:44,368 - distributed.nanny - WARNING - Restarting worker
2022-08-29 22:49:44,453 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 796, in kill
    await process.join(wait_timeout)
  File "/Users/runner/work/distributed/distributed/distributed/process.py", line 311, in join
    assert self._state.pid is not None, "can only join a started process"
AssertionError: can only join a started process
2022-08-29 22:49:44,455 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 796, in kill
    await process.join(wait_timeout)
  File "/Users/runner/work/distributed/distributed/distributed/process.py", line 311, in join
    assert self._state.pid is not None, "can only join a started process"
AssertionError: can only join a started process
@gjoseph92 gjoseph92 added the flaky test Intermittent failures on CI. label Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

No branches or pull requests

1 participant