Skip to content

DEADLOCK: fetch->forgotten->fetch #6480

@crusaderky

Description

@crusaderky
  1. A task transitions to fetch and is added to data_needed
  2. _ensure_communicating runs, but the network is saturated so the task is not popped from data_needed
  3. Task is forgotten
  4. Task is recreated from scratch and transitioned to fetch again
  5. BUG: data_needed.push silently does nothing, because it still contains the forgotten task, which is a different TaskState instance
  6. _ensure_communicating runs. It pops the forgotten task and discards it.
  7. We now have a task stuck forever in fetch state. If validate=True, validate_state() fails.

Log

2022-05-30 16:35:50,293 - distributed.core - ERROR - Invalid TaskState encountered for <TaskState 'inc-60d23875ec5497d5404aad1ce8fcd252' fetch>.
Story:
[
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'ensure-task-exists', 'released', 'compute-task-1653924950.2773008', 1653924950.279128),
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'released', 'fetch', 'fetch', {}, 'compute-task-1653924950.2773008', 1653924950.2794898),
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'release-key', 'worker-connect-1653924950.127911', 1653924950.2856364), 
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'fetch', 'released', 'released', {'inc-60d23875ec5497d5404aad1ce8fcd252': 'forgotten'}, 'worker-connect-1653924950.127911', 1653924950.2856448),
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'released', 'forgotten', 'forgotten', {}, 'worker-connect-1653924950.127911', 1653924950.2856493),
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'ensure-task-exists', 'released', 'compute-task-1653924950.2900314', 1653924950.292284),
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'released', 'fetch', 'fetch', {}, 'compute-task-1653924950.2900314', 1653924950.2924364),
]
Traceback (most recent call last):
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4214, in validate_task
    self.validate_task_fetch(ts)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4155, in validate_task_fetch
    assert ts in self.data_needed
AssertionError

Reproducer

@gen_cluster(client=True)
async def test_forget_data_needed(c, s, a, b):
    """
    1. A task transitions to fetch and is added to data_needed
    2. _ensure_communicating runs, but the network is saturated so the task is not
       popped from data_needed
    3. Task is forgotten
    4. Task is recreated from scratch and transitioned to fetch again
    5. BUG: data_needed.push silently does nothing, because it still contains the
       forgotten task, which is a different TaskState instance
    6. _ensure_communicating runs. It pops the forgotten task and discards it.
    7. We now have a task stuck in fetch state.
    """
    x = c.submit(inc, 1, key="x", workers=[a.address])
    with freeze_data_fetching(b):
        y = c.submit(inc, x, key="y", workers=[b.address])
        await wait_for_state("x", "fetch", b)
        x.release()
        y.release()
        while s.tasks or a.tasks or b.tasks:
            await asyncio.sleep(0.01)

    x = c.submit(inc, 2, key="x", workers=[a.address])
    y = c.submit(inc, x, key="y", workers=[b.address])
    assert await y == 4

The test deadlock on the last line.
freeze_data_fetching is from #6342.

Proposed design

  • Replace UniqueTaskHeap with a generic HeapSet, which is a MutableSet where pop() returns the first element according to an arbitrary key function. You can discard keys from the middle of the heap, like in any other set. Internally it's implemented using weakrefs.
  • proactively discard tasks from data_needed whenever they transition away from fetch.

Metadata

Metadata

Assignees

Labels

deadlockThe cluster appears to not make any progress

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions