Skip to content

Shuffle service resilience #6105

@mrocklin

Description

@mrocklin

In #5520 and #5976 and #6007 we've started a shuffle service. This has better memory characteristics, but is not resilient. In particular, it can break in a few ways:

  1. A worker holding shuffle outputs can die mid-shuffle
  2. New outputs of a shuffle can be requested by a client after the shuffle has started
  3. Output futures of a shuffle can be unrequested by a client

There are a few ways to solve this problem. One way I'd like to discuss here is opening up scheduler events to extensions, and letting them trigger transitions. In particular both scenarios 1 and 2 can be handled by letting the extension track remove_worker and update_graph events and restart all shuffle tasks if an output-holding worker dies, or if any of the existing shuffles occur in a new graph. Scenario 3 can be handled by letting the extension track transition events, and clean things up when the barrier task transitions out of memory.

So far, I think that this can solve all of the resilience issues in shuffling (at least everything I've come across). However, it introduces two possible concerns:

1 - Scheduler performance

Maybe it doesn't make sense for every transition to cycle through every extension to see if they care about transitions.

This doesn't seem to be that expensive in reality

In [1]: extensions = [object() for _ in range(10)]

In [2]: %%timeit
   ...: for i in range(1000):
   ...:     for extension in extensions:
   ...:         if hasattr(extension, "transition"):
   ...:             extension.transiiton()
   ...:
383 µs ± 7.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So around 40ns per extension per transition. It's well under a microsecond for all of our extensions.

2 - Complexity

Now any extension can inject transitions. Horrible horrible freedom!

My guess is that this is ok as long as we maintain some hygiene around, for example, always using the trasitions system, rather than mucking about with state directly

This is also a somewhat systemic change for what is, today, a single feature.

cc @gjoseph92 @fjetter for feedback

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions