-
-
Notifications
You must be signed in to change notification settings - Fork 749
Closed
Labels
deadlockThe cluster appears to not make any progressThe cluster appears to not make any progress
Description
- A task transitions to fetch and is added to data_needed
- _ensure_communicating runs, but the network is saturated so the task is not popped from data_needed
- Task is forgotten
- Task is recreated from scratch and transitioned to fetch again
- BUG: data_needed.push silently does nothing, because it still contains the forgotten task, which is a different TaskState instance
- _ensure_communicating runs. It pops the forgotten task and discards it.
- We now have a task stuck forever in fetch state. If validate=True, validate_state() fails.
Log
2022-05-30 16:35:50,293 - distributed.core - ERROR - Invalid TaskState encountered for <TaskState 'inc-60d23875ec5497d5404aad1ce8fcd252' fetch>.
Story:
[
('inc-60d23875ec5497d5404aad1ce8fcd252', 'ensure-task-exists', 'released', 'compute-task-1653924950.2773008', 1653924950.279128),
('inc-60d23875ec5497d5404aad1ce8fcd252', 'released', 'fetch', 'fetch', {}, 'compute-task-1653924950.2773008', 1653924950.2794898),
('inc-60d23875ec5497d5404aad1ce8fcd252', 'release-key', 'worker-connect-1653924950.127911', 1653924950.2856364),
('inc-60d23875ec5497d5404aad1ce8fcd252', 'fetch', 'released', 'released', {'inc-60d23875ec5497d5404aad1ce8fcd252': 'forgotten'}, 'worker-connect-1653924950.127911', 1653924950.2856448),
('inc-60d23875ec5497d5404aad1ce8fcd252', 'released', 'forgotten', 'forgotten', {}, 'worker-connect-1653924950.127911', 1653924950.2856493),
('inc-60d23875ec5497d5404aad1ce8fcd252', 'ensure-task-exists', 'released', 'compute-task-1653924950.2900314', 1653924950.292284),
('inc-60d23875ec5497d5404aad1ce8fcd252', 'released', 'fetch', 'fetch', {}, 'compute-task-1653924950.2900314', 1653924950.2924364),
]
Traceback (most recent call last):
File "/home/crusaderky/github/distributed/distributed/worker.py", line 4214, in validate_task
self.validate_task_fetch(ts)
File "/home/crusaderky/github/distributed/distributed/worker.py", line 4155, in validate_task_fetch
assert ts in self.data_needed
AssertionError
Reproducer
@gen_cluster(client=True)
async def test_forget_data_needed(c, s, a, b):
"""
1. A task transitions to fetch and is added to data_needed
2. _ensure_communicating runs, but the network is saturated so the task is not
popped from data_needed
3. Task is forgotten
4. Task is recreated from scratch and transitioned to fetch again
5. BUG: data_needed.push silently does nothing, because it still contains the
forgotten task, which is a different TaskState instance
6. _ensure_communicating runs. It pops the forgotten task and discards it.
7. We now have a task stuck in fetch state.
"""
x = c.submit(inc, 1, key="x", workers=[a.address])
with freeze_data_fetching(b):
y = c.submit(inc, x, key="y", workers=[b.address])
await wait_for_state("x", "fetch", b)
x.release()
y.release()
while s.tasks or a.tasks or b.tasks:
await asyncio.sleep(0.01)
x = c.submit(inc, 2, key="x", workers=[a.address])
y = c.submit(inc, x, key="y", workers=[b.address])
assert await y == 4The test deadlock on the last line.
freeze_data_fetching is from #6342.
Proposed design
- Replace
UniqueTaskHeapwith a genericHeapSet, which is a MutableSet wherepop()returns the first element according to an arbitrary key function. You can discard keys from the middle of the heap, like in any other set. Internally it's implemented using weakrefs. - proactively discard tasks from data_needed whenever they transition away from fetch.
Metadata
Metadata
Assignees
Labels
deadlockThe cluster appears to not make any progressThe cluster appears to not make any progress