Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky distributed/tests/test_client.py::test_threadsafe_compute #7336

Open
gjoseph92 opened this issue Nov 18, 2022 · 0 comments
Open

Flaky distributed/tests/test_client.py::test_threadsafe_compute #7336

gjoseph92 opened this issue Nov 18, 2022 · 0 comments
Labels
flaky test Intermittent failures on CI.

Comments

@gjoseph92
Copy link
Collaborator

2022-11-18 05:04:47,523 - distributed.scheduler - ERROR - Error transitioning "('arange-sum-1960f5e08fa9c9689f5e9ef3be470377', 8)" from 'processing' to 'memory'

Traceback (most recent call last):

  File "D:\a\distributed\distributed\distributed\scheduler.py", line 1842, in _transition

    recommendations, client_msgs, worker_msgs = func(

  File "D:\a\distributed\distributed\distributed\scheduler.py", line 2[359](https://github.com/dask/distributed/actions/runs/3494272750/jobs/5849912273#step:18:360), in transition_processing_memory

    self._exit_processing_common(ts, recommendations)

  File "D:\a\distributed\distributed\distributed\scheduler.py", line 3201, in _exit_processing_common

    for qts in self._next_queued_tasks_for_worker(ws):

  File "D:\a\distributed\distributed\distributed\scheduler.py", line 3218, in _next_queued_tasks_for_worker

    assert qts.state == "queued", qts.state

AssertionError: forgotten

2022-11-18 05:04:47,523 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:52864', status: running, memory: 7, processing: 1>

2022-11-18 05:04:47,523 - distributed.core - INFO - Connection to tcp://127.0.0.1:52859 has been closed.

2022-11-18 05:04:47,523 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:52864

2022-11-18 05:04:47,523 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52864. Reason: worker-handle-scheduler-connection-broken

2022-11-18 05:04:47,820 - distributed.core - ERROR - forgotten

https://github.com/dask/distributed/actions/runs/3494272750/jobs/5849912273#step:18:352

This is weird. Maybe indicates an actual bug? I'm not sure how a task in the queue could be in state forgotten. queued->released removes the task from scheduler.queue. So any direct queued->forgotten transition would go through there first.

But given that the whole process dies, perhaps this is a red herring and something much deeper is broken, and this transition happens to be one manifestation of this thing breaking.

@gjoseph92 gjoseph92 added the flaky test Intermittent failures on CI. label Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

No branches or pull requests

1 participant