Automatically restart P2P shuffles when output worker leaves #7970

hendrikmakait · 2023-07-05T15:38:42Z

~~Blocked by and includes #7967~~
~~Blocked by and includes #7979~~
~~Blocked by and includes #7981~~
~~Blocked by and includes #7974~~

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2023-07-05T17:16:10Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      20 files ±        0       20 suites ±0 11h 35m 10s ⏱️ + 10h 14m 44s
  3 738 tests +    434   3 623 ✔️ +    553   106 💤 - 126 9 ❌ +8
36 154 runs +29 467 34 397 ✔️ +28 504 1 748 💤 +971 9 ❌ +8

For more details on these failures, see this check.

Results for commit 5764f45. ± Comparison against base commit efc7eeb.

This pull request skips 1 and un-skips 138 tests.

distributed.shuffle.tests.test_merge

distributed.cli.tests.test_dask_worker ‑ test_dashboard_non_standard_ports
distributed.cli.tests.test_dask_worker ‑ test_listen_address_ipv6[tcp://:---nanny]
distributed.cli.tests.test_dask_worker ‑ test_listen_address_ipv6[tcp://:---no-nanny]
distributed.cli.tests.test_dask_worker ‑ test_listen_address_ipv6[tcp://[::1]:---nanny]
distributed.cli.tests.test_dask_worker ‑ test_listen_address_ipv6[tcp://[::1]:---no-nanny]
distributed.comm.tests.test_comms ‑ test_default_client_server_ipv6[asyncio]
distributed.comm.tests.test_comms ‑ test_default_client_server_ipv6[tornado]
distributed.comm.tests.test_comms ‑ test_tcp_client_server_ipv6[asyncio]
distributed.comm.tests.test_comms ‑ test_tcp_client_server_ipv6[tornado]
distributed.comm.tests.test_comms ‑ test_tls_client_server_ipv6[asyncio]
…

♻️ This comment has been updated with latest results.

This reverts commit 853d953.

distributed/scheduler.py

hendrikmakait · 2023-07-19T14:27:21Z

@wence-: Your feedback should be addressed and blocking PRs are merged, so this should be good for another round.

phofl

Generally looks good to me, but not familiar enough with the code-base to be the final reviewer

fjetter

Most of my comments are nits and you can ignore or use them.

The one thing I'm concerned about is that the coverage report indicates some rather critical code sections are not covered. We should look into that before merging

fjetter · 2023-07-20T08:59:06Z

distributed/shuffle/_scheduler_plugin.py

        try:
-            shuffle = self.states[shuffle_id]
+            shuffle = self.active_shuffles[shuffle_id]
        except KeyError:
-            return
-        self._fail_on_workers(shuffle, message=f"{shuffle} forgotten")
-        self._clean_on_scheduler(shuffle_id)
+            pass
+        else:
+            self._fail_on_workers(shuffle, message=f"{shuffle} forgotten")
+            self._clean_on_scheduler(shuffle_id, stimulus_id=stimulus_id)


This is more a style note and I typically try to avoid style questions in a PR review. Still, this feels a bit convoluted. I believe something like

if shuffle := self.active_shuffles.get(shuffle_id): self._fail_on_workers(shuffle, message=f"{shuffle} forgotten") self._clean_on_scheduler(shuffle_id, stimulus_id=stimulus_id) elif finish == "forgotten": ...

is easier to read than try/except;pass/else. with or without walrus.

Feel free to ignore. Logic is the same in the end

Fair point, this bit has gotten out of hand.

fjetter · 2023-07-20T09:01:19Z

distributed/shuffle/_scheduler_plugin.py

+    def __eq__(self, other: Any) -> bool:
+        return type(other) == type(self) and other.run_id == self.run_id


This appears to be not covered by tests. Why is this necessary then?

It had some use in an earlier iteration. Removed.

My guess is because __hash__ is now not the default and this addition of __eq__ ensures that __eq__ and the newly defined __hash__ are consistent.

I suppose this is because the run_id is a unique token that defines the shuffle state object.

I suppose this is because the run_id is a unique token that defines the shuffle state object.

That's why I initially had a new __eq__. Then __hash__ had to match it. Now I'm only using __hash__, so I think there's no need for a custom __eq__ that could potentially get outdated.

fjetter · 2023-07-20T09:02:02Z

distributed/shuffle/_shuffle.py

+    except ShuffleClosedError:
+        raise Reschedule()


This lack of coverage make me nervous. I think around the barrier there are various races we should test.

I haven't been able to come up with a scenario where this would be triggered (and relevant), so I've removed it for now. If this ever pops up for somebody, I hope they'll send a bug report our way.

fjetter · 2023-07-20T09:04:15Z

distributed/shuffle/_scheduler_plugin.py

+            self._clean_on_scheduler(shuffle_id, stimulus_id=stimulus_id)
+
+        if finish == "forgotten":
+            shuffles = self._shuffles.pop(shuffle_id)


IIUC this entire logic is just there to clean up. Behavior would not be impacted if we didn't do any of this, correct?

Yes, this is state cleanup on the scheduler plugin.

fjetter · 2023-07-20T09:07:36Z

distributed/shuffle/_scheduler_plugin.py

+            recs.update({dt.key: "released"})
+
+        if barrier_task.state == "erred":
+            return {}  # pragma: no cover


why don't you want coverage to detect this? Seems like an important case

#7970 (comment)

Added a comment to explain

This seems like an ideal case for an assert False, "Invariant broken" ?

That would also work. I'm wondering if assert False is the right thing to add here given that PYTHONOPTIMIZE will strip them. It would work as an addition though.

raising a RuntimeError now

fjetter · 2023-07-20T09:07:47Z

distributed/shuffle/_scheduler_plugin.py

+
+        for dt in barrier_task.dependencies:
+            if dt.state == "erred":
+                return {}  # pragma: no cover


#7970 (comment)

Added a comment to explain

Similarly here.

fjetter · 2023-07-20T09:15:42Z

distributed/shuffle/_worker_plugin.py

+        while self._runs:
+            await asyncio.sleep(0.1)


I don't insist on this but I don't like these sleep patterns.

def __init__(...): self._runs_condition = asyncio.Condition() async def _close_shuffle_run(self, shuffle: ShuffleRun) -> None: await shuffle.close() async with self._runs_condition: self._runs.remove(shuffle) self._runs_condition.notify_all() async def teardown(self, worker: Worker) -> None: ... async with self._runs_condition: await self._runs_condition.wait_for(lambda: not self._runs)

would be a clean alternative. Many people consider Conditions too complex but what I like about them is that they make this relationship very clear (and they unblock immediately which is nice for testing and such things).

As I said, I don't insist on this

To a more serious question: Is it possible for _runs to be repopulated at this point or are we locking everything up properly for this to not happen once we reach this point?

Makes sense! I've added the condition. At this point the plugin is closed which will raise a ShuffleClosedError before a new run can be added.

_runs is added to in _refresh_shuffle which doesn't have a lock associated with it. But I am not sure if that can be running simultaneously with teardown.

teardown sets

distributed/distributed/shuffle/_worker_plugin.py

Lines 890 to 893 in 90eb9ea

async def teardown(self, worker: Worker) -> None:

assert not self.closed

self.closed = True

Once that is done,

distributed/distributed/shuffle/_worker_plugin.py

Lines 810 to 811 in 90eb9ea

if self.closed:

raise ShuffleClosedError(f"{self} has already been closed")

will raise. We don't yield the event loop in-between that raise and

distributed/distributed/shuffle/_worker_plugin.py

Line 832 in 90eb9ea

self._runs.add(shuffle)

fjetter · 2023-07-20T09:19:33Z

distributed/shuffle/tests/test_shuffle.py

+@gen_cluster(
+    client=True,
+    nthreads=[("", 1)] * 2,
+    config={"distributed.scheduler.allowed-failures": 0},


does this mean that P2P is now retried allowed-failures times if a worker goes OOM?

Wouldn't be a dealbreaker but I also don't think this is useful. It's very unlikely that another P2P run attempt would be more successful.

However, there are of course also cases like spot interruption where this matters... Never mind!

does this mean that P2P is now retried allowed-failures times if a worker goes OOM?

Yes, as there might be other causes apart from an output partition being too large.

fjetter · 2023-07-20T09:24:00Z

distributed/shuffle/tests/test_shuffle.py

@@ -578,12 +699,39 @@ async def test_closed_worker_during_unpack(c, s, a, b):
        freq="10 s",
    )
    out = dd.shuffle.shuffle(df, "x", shuffle="p2p")
-    out = out.persist()
+    x, y = c.compute([df.x.size, out.x.size])
    await wait_for_tasks_in_state("shuffle-p2p", "memory", 1, b)


I guess this is out of scope for this PR but I think it would make sense to have an API to easily get access to the actual shuffle instnaces held by the plugins and have a stage attribute that indicates whether we're in transfer, barrier or unpack stage.
I would find this kind of verification nicer than waiting for task states.

fjetter

Most of my comments are nits and you can ignore or use them.

The one thing I'm concerned about is that the coverage report indicates some rather critical code sections are not covered. We should look into that before merging

hendrikmakait · 2023-07-20T10:11:50Z

distributed/shuffle/_worker_plugin.py

+    def _create_shuffle_run(
+        self, shuffle_id: ShuffleId, result: dict[str, Any]
+    ) -> ShuffleRun:
        shuffle: ShuffleRun
        if result["type"] == ShuffleType.DATAFRAME:
-            shuffle = DataFrameShuffleRun(
-                column=result["column"],
-                worker_for=result["worker_for"],
-                output_workers=result["output_workers"],
-                id=shuffle_id,
-                run_id=result["run_id"],
-                directory=os.path.join(
-                    self.worker.local_directory,
-                    f"shuffle-{shuffle_id}-{result['run_id']}",
-                ),
-                executor=self._executor,
-                local_address=self.worker.address,
-                rpc=self.worker.rpc,
-                scheduler=self.worker.scheduler,
-                memory_limiter_disk=self.memory_limiter_disk,
-                memory_limiter_comms=self.memory_limiter_comms,
-            )
+            shuffle = self._create_dataframe_shuffle_run(shuffle_id, result)
        elif result["type"] == ShuffleType.ARRAY_RECHUNK:
-            shuffle = ArrayRechunkRun(
-                worker_for=result["worker_for"],
-                output_workers=result["output_workers"],
-                old=result["old"],
-                new=result["new"],
-                id=shuffle_id,
-                run_id=result["run_id"],
-                directory=os.path.join(
-                    self.worker.local_directory,
-                    f"shuffle-{shuffle_id}-{result['run_id']}",
-                ),
-                executor=self._executor,
-                local_address=self.worker.address,
-                rpc=self.worker.rpc,
-                scheduler=self.worker.scheduler,
-                memory_limiter_disk=self.memory_limiter_disk,
-                memory_limiter_comms=self.memory_limiter_comms,
-            )
+            shuffle = self._create_array_rechunk_run(shuffle_id, result)
        else:  # pragma: no cover
            raise TypeError(result["type"])
-        self.shuffles[shuffle_id] = shuffle
-        self._runs.add(shuffle)
        return shuffle

+    def _create_dataframe_shuffle_run(
+        self, shuffle_id: ShuffleId, result: dict[str, Any]
+    ) -> DataFrameShuffleRun:
+        return DataFrameShuffleRun(
+            column=result["column"],
+            worker_for=result["worker_for"],
+            output_workers=result["output_workers"],
+            id=shuffle_id,
+            run_id=result["run_id"],
+            directory=os.path.join(
+                self.worker.local_directory,
+                f"shuffle-{shuffle_id}-{result['run_id']}",
+            ),
+            executor=self._executor,
+            local_address=self.worker.address,
+            rpc=self.worker.rpc,
+            scheduler=self.worker.scheduler,
+            memory_limiter_disk=self.memory_limiter_disk,
+            memory_limiter_comms=self.memory_limiter_comms,
+        )
+
+    def _create_array_rechunk_run(
+        self, shuffle_id: ShuffleId, result: dict[str, Any]
+    ) -> ArrayRechunkRun:
+        return ArrayRechunkRun(
+            worker_for=result["worker_for"],
+            output_workers=result["output_workers"],
+            old=result["old"],
+            new=result["new"],
+            id=shuffle_id,
+            run_id=result["run_id"],
+            directory=os.path.join(
+                self.worker.local_directory,
+                f"shuffle-{shuffle_id}-{result['run_id']}",
+            ),
+            executor=self._executor,
+            local_address=self.worker.address,
+            rpc=self.worker.rpc,
+            scheduler=self.worker.scheduler,
+            memory_limiter_disk=self.memory_limiter_disk,
+            memory_limiter_comms=self.memory_limiter_comms,
+        )
+


Cosmetical refactoring to make it easier to understand whether we could potentially encounter races.

wence-

Not sure about the maintenance of invariants in remove_worker, plus a similar request to @fjetter for some coverage on edge cases?

wence- · 2023-07-20T09:33:45Z

distributed/shuffle/_scheduler_plugin.py

+    def __eq__(self, other: Any) -> bool:
+        return type(other) == type(self) and other.run_id == self.run_id


My guess is because __hash__ is now not the default and this addition of __eq__ ensures that __eq__ and the newly defined __hash__ are consistent.

wence- · 2023-07-20T09:37:35Z

distributed/shuffle/_scheduler_plugin.py

+    def __eq__(self, other: Any) -> bool:
+        return type(other) == type(self) and other.run_id == self.run_id


I suppose this is because the run_id is a unique token that defines the shuffle state object.

wence- · 2023-07-20T09:39:39Z

distributed/shuffle/_scheduler_plugin.py

+        if worker not in self.scheduler.workers:
+            raise RuntimeError(f"Scheduler is unaware of this worker {worker!r}")


Can this be tested by retiring a worker during a shuffle in a test?

I haven't been able to come up with a scenario where this would happen, but given how messy worker shutdown can be, I'm not 100% certain this would never happen. Left it in with a note for now.

wence- · 2023-07-20T09:41:38Z

distributed/shuffle/_scheduler_plugin.py

+            if worker not in shuffle.participating_workers:
+                continue


Test by adding a worker to the cluster and then restarting a shuffle?

refactored.

wence- · 2023-07-20T09:43:09Z

distributed/shuffle/_scheduler_plugin.py


-        stimulus_id = f"shuffle-failed-worker-left-{time()}"
+            self._restart_shuffle(shuffle.id, scheduler, stimulus_id=stimulus_id)


OK, so first we restart all shuffles that were interrupted.

I think that restarting this shuffle should remove it from _archived_by but I do not see that happening. Do I have that right? Or does this somehow create a new shuffle object that has archived_by = None. Otherwise it seems like it might get lost in _clean_on_scheduler.

Restarting a shuffle removes the ShuffleState from active states. The first shuffle_transfer task to call shuffle_get_or_create will cause the SchedulerPlugin to create a new ShuffleState with an incremented run_id and _archived_by = None.

wence- · 2023-07-20T09:44:46Z

distributed/shuffle/_scheduler_plugin.py

+        # If processing the transactions causes a task to get released, this
+        # removes the shuffle from self.active_shuffles. Therefore, we must iterate
+        # over a copy.
+        for shuffle_id, shuffle in self.active_shuffles.copy().items():


Then we iterate over all active shuffles, remove and restart?

Why do we not unconditionally restart the archived shuffles after this loop over active shuffles?

I'm not 100% sure I'm following, but what I think you're saying is a very good point.

wence- · 2023-07-20T10:11:33Z

distributed/shuffle/_worker_plugin.py

+        while self._runs:
+            await asyncio.sleep(0.1)


_runs is added to in _refresh_shuffle which doesn't have a lock associated with it. But I am not sure if that can be running simultaneously with teardown.

wence- · 2023-07-20T10:12:57Z

distributed/shuffle/_scheduler_plugin.py

+            recs.update({dt.key: "released"})
+
+        if barrier_task.state == "erred":
+            return {}  # pragma: no cover


This seems like an ideal case for an assert False, "Invariant broken" ?

wence- · 2023-07-20T10:13:23Z

distributed/shuffle/_scheduler_plugin.py

+
+        for dt in barrier_task.dependencies:
+            if dt.state == "erred":
+                return {}  # pragma: no cover


Similarly here.

wence- · 2023-07-20T10:26:28Z

distributed/shuffle/_worker_plugin.py

+        async with self._runs_condition:
+            await self._runs_condition.wait_for(lambda: not self._runs)


This lock protects _runs wrt _close_shuffle_run but not wrt _refresh_shuffle I think.

I've renamed it to _runs_cleanup_condition to highlight that it's only concerned with cleanup. There's a different mechanism in place for adding to self._runs. (Feel free to refactor in a follow-up if you see a good way of doing so.)

hendrikmakait · 2023-07-21T10:08:34Z

I've added another test, now all feedback should be addressed. For # pragma: nocover code I haven't been able to create a reproducer, but I think it doesn't hurt to leave those few places in the codebase "just in case". This should be ready for another review.

wence-

To the best of my understanding, this looks right!

hendrikmakait added 3 commits July 5, 2023 17:29

Adjust tests

6658a65

Restart shuffle

4e7b425

Remove erred_shuffles

1cfa0d3

hendrikmakait added 5 commits July 6, 2023 10:19

Fix failures on unpack

fe79aa3

transitions

853d953

Revert "transitions"

e058234

This reverts commit 853d953.

Adjust tests

8510c06

Recommendations

6138fc6

hendrikmakait mentioned this pull request Jul 6, 2023

Pass stimulus_id to SchedulerPlugin hooks #7973

Closed

stimulus_id in remove_worker and transition

0e7b051

hendrikmakait mentioned this pull request Jul 6, 2023

Pass stimulus_id to SchedulerPlugin.remove_worker and SchedulerPlugin.transition #7974

Merged

2 tasks

Use stimulus_id as transaction

8e92b85

hendrikmakait commented Jul 6, 2023

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

distributed/scheduler.py Outdated Show resolved Hide resolved

hendrikmakait added 16 commits July 6, 2023 18:36

Tests

3be333e

to kwarg

8412d2c

kwarg

ba67405

warning

9d804dc

forgiveness

0fd66cd

forgiveness

4f456e9

forgiveness

8f02f8d

Minor

29405c6

Move warning

476f132

Merge branch 'main' into restart-p2p-shuffle

d34917e

Minor

69f38f8

Merge branch 'main' into restart-p2p-shuffle

337dc72

Closing

2a18033

Improved error messages

85c7ab2

Adjust tests

263a75b

Fix

266732e

This was referenced Jul 20, 2023

Release 2023.7.1 dask/community#334

Closed

Simplify state management in P2P's ShuffleSchedulerPlugin #8018

Open

phofl reviewed Jul 20, 2023

View reviewed changes

fjetter reviewed Jul 20, 2023

View reviewed changes

hendrikmakait added 4 commits July 20, 2023 11:33

Remove __eq__

3ac439c

walrussin'

d72596b

Condition

f14c71b

Docstring

90eb9ea

hendrikmakait commented Jul 20, 2023

View reviewed changes

wence- reviewed Jul 20, 2023

View reviewed changes

hendrikmakait added 3 commits July 20, 2023 12:36

cond

4b4ab50

Simplify restart

dd7ac3e

raise if invariant broken

9459019

ncclementi mentioned this pull request Jul 20, 2023

Dask Demo Day 2023-07-20 dask/community#333

Closed

hendrikmakait added 5 commits July 20, 2023 18:27

Additional test

3213ef5

Fix race

e4be94f

Fix test

c1ef0be

No reproducer

434d5a3

doc and ignore

1f0ea84

hendrikmakait requested review from fjetter and wence- July 21, 2023 10:08

wence- approved these changes Jul 21, 2023

View reviewed changes

hendrikmakait added 3 commits July 24, 2023 14:39

Merge branch 'main' into restart-p2p-shuffle

9e7c2e8

Merge branch 'main' into restart-p2p-shuffle

36a2c94

[skip-caching]

5764f45

hendrikmakait merged commit f0303aa into dask:main Jul 24, 2023

		def __eq__(self, other: Any) -> bool:
		return type(other) == type(self) and other.run_id == self.run_id

	async def teardown(self, worker: Worker) -> None:
	assert not self.closed

	self.closed = True

	if self.closed:
	raise ShuffleClosedError(f"{self} has already been closed")

		if worker not in self.scheduler.workers:
		raise RuntimeError(f"Scheduler is unaware of this worker {worker!r}")


		stimulus_id = f"shuffle-failed-worker-left-{time()}"
		self._restart_shuffle(shuffle.id, scheduler, stimulus_id=stimulus_id)

		async with self._runs_condition:
		await self._runs_condition.wait_for(lambda: not self._runs)

Automatically restart P2P shuffles when output worker leaves #7970

Automatically restart P2P shuffles when output worker leaves #7970

Conversation

hendrikmakait commented Jul 5, 2023 • edited Loading

github-actions bot commented Jul 5, 2023 • edited Loading

Unit Test Results

hendrikmakait commented Jul 19, 2023

phofl left a comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait Jul 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait commented Jul 21, 2023

wence- left a comment

Choose a reason for hiding this comment

hendrikmakait commented Jul 5, 2023 •

edited

Loading

github-actions bot commented Jul 5, 2023 •

edited

Loading

hendrikmakait Jul 21, 2023 •

edited

Loading

hendrikmakait Jul 20, 2023 •

edited

Loading