Automatically restart P2P shuffles that were aborted due to leaving workers #7353

fjetter · 2022-11-25T15:55:37Z

With #7326 we guarantee that a shuffle consistently fails if a participating worker dies (input-only workers will also trigger this since we currently do not guarantee exactly-once guarantees, see #7324). This is currently a hard requirement since input tasks split their data by output partition and push these shards to the designated output workers.
If one of the output worker dies we loose the data that was already pushed to it. The tasks assigned to the dead worker at the time are not representative of the lost data and it is not sufficient to reschedule the lost tasks but we need to reschedule all of them.

Whenever a shuffle is rescheduled, we will generate new metadata on the scheduler side with a newly calculated output worker mapping. We will increase the attempt or generation counter on this metadata to distinguish this run from the earlier one. All transfer tasks need to be rescheduled and will need to execute at least once using the new metadata.

Rescheduling is not an atomic operation and we need to assume that old generation tasks are still running or are about to run while we are resetting the metadata on the scheduler, are broadcasting it to all workers and are transitioning the tasks.
Neither scheduler, nor workers do have a notion of 'attempts' or 'generations'. Therefore any finished task from an earlier generation, both failure (all tasks) and success (transfer+barrier), will corrupt an ongoing shuffle unless we can detect old-generation task-finished responses and handle them accordingly.

Summarized, we need to

Ensure that every task of a given generation is executed at least once
Suppress any finished tasks from an earlier generation

Iff we do not guaranteed deduplication on receiver end (#7324) we further need to ensure that shards of an earlier generation must not be accepted by Shuffle instances of a newer generation. Otherwise, if the output_worker mappings between old and new generation map some output partitions to the same worker, old generation transfer tasks may already send some shards to the new generation receivers. Once this task is finished and rescheduled, this would effectively cause data duplication.
Apart from #7324 we need weaker guarentees to avoid this and it is sufficient if we reject transfers that do not belong to the same generation.

Ensure transfers of an earlier generation do not contribute to the new generation, i.e. new generation receivers must reject old generation tranfers

1.) is trivially achieved by simply releasing all transfer tasks and the barrier task.
Note that not necessarily all unpack tasks will automatically be rescheduled. If the unpack tasks were already released beforehand and the dependents are still in memory, they will not be recomputed. This will affect some internal logic that counts how many output tasks where already processed to determine whether a shuffle is "done".

2.) The only possibility to entirely ignore old generation results is to expand the scheduler transition engine with this concept. Every Scheduler->Worker compute-task message will sign this request with a unique identifier (could be a simple counter / the stimulus id / etc.) and remembers the most recent identifier. The Worker will store this unique identifier in a TaskState attribute and will return it again in it's task-finished message. The scheduler handlers can then properly deduplicate the requests by ignoring stale responses.
This concept has already been successfully implemented in work stealing.
This would be the first modification of the actual scheduler, outside of an extension to enable P2P. I believe this additional guarantee would not be harmful and may even avoid some fringe races that are currently dealt with by sophisticated transitions.

Note: A mere transition hook is not sufficient since a task failure would already generate a client message indicating a failure. the transition hooks do not allow us to intercept messages to clients (nor workers) and I don't think this would be a table API

3.) If the replicated shuffle metadata includes an attempt/generation counter which is included in the shards submission. the receiving end will ignore all pushes that do not match it's own. If the sender is old generation, the sender task can simply err/finish (pending 1.)). If the sender is new generation it needs to repeat until the receiving end is updated.

cc @hendrikmakait @mrocklin

The text was updated successfully, but these errors were encountered:

fjetter · 2022-12-06T10:50:33Z

FWIW I believe 3.) would also simplify implementation of #7324

fjetter · 2022-12-06T14:33:01Z

A proposal for 2): #7372 (possibly incomplete)

fjetter · 2022-12-06T14:51:52Z

For the sake of completeness, I believe 2.) and 3.) would also be achieved if the remove_worker hook in the scheduler extension could somehow wait for all workers to confirm that the threadpool is idle // all transfer tasks were released.
This is in practice difficult and I am a bit concerned about "blocking" in remove_worker on some non-trivial logic, particularly one that may be called multiple times in short succession (e.g. 5 workers failed simultaneously because a host failed)

hendrikmakait · 2022-12-08T11:31:08Z

Regarding 1.):

When implementing, we need to be aware of the lag between Scheduler.remove_worker transitioning tasks previously running on the leaving worker and the execution of the ShuffleSchedulerExtension.remove_worker hook.

Scheduler.remove_worker transitions tasks to released or erred before we reach the plugin hook. Rescheduling tasks without updating the metadata may cause them to fail. If the tasks transition to erred before we release them in the plugin hook, we will send client messages. (A possible solution may be #5403.)

FWIW I believe 3.) would also simplify implementation of #7324

I don't think I agree, but it would demote the problem to a fringe edge case that should not occur in the wild given the current implementation of the system.

fjetter · 2022-12-08T13:54:08Z

Scheduler.remove_worker transitions tasks to released or erred before we reach the plugin hook. Rescheduling tasks without updating the metadata may cause them to fail. If the tasks transition to erred before we release them in the plugin hook, we will send client messages. (A possible solution may be #5403.)

IIUC you are talking about the case where a failure is reported before the scheduler is aware of the failing worker.

I don't think I agree, but it would demote the problem to a fringe edge case that should not occur in the wild given the current implementation of the system.

We could do deduplication by (SourceID, attempt). Right now, we'd need to deduplicate by (SourceID, TargetID) but all TargetIDs are currently meshed up in the same bytestring.
Off topic, of course.

hendrikmakait · 2022-12-08T14:18:19Z

IIUC you are talking about the case where a failure is reported before the scheduler is aware of the failing worker.

The case I am talking about is when the scheduler starts taking action in Scheduler.remove_worker, i.e., transitioning all tasks in

distributed/distributed/scheduler.py

Lines 4804 to 4844 in 53284cd

    
           ts: TaskState 
        
           for ts in list(ws.processing): 
        
               k = ts.key 
        
               recommendations[k] = "released" 
        
               if not safe: 
        
                   ts.suspicious += 1 
        
                   ts.prefix.suspicious += 1 
        
                   if ts.suspicious > self.allowed_failures: 
        
                       del recommendations[k] 
        
                       e = pickle.dumps( 
        
                           KilledWorker( 
        
                               task=k, 
        
                               last_worker=ws.clean(), 
        
                               allowed_failures=self.allowed_failures, 
        
                           ), 
        
                       ) 
        
                       r = self.transition( 
        
                           k, 
        
                           "erred", 
        
                           exception=e, 
        
                           cause=k, 
        
                           stimulus_id=stimulus_id, 
        
                           worker=address, 
        
                       ) 
        
                       recommendations.update(r) 
        
                       logger.info( 
        
                           "Task %s marked as failed because %d workers died" 
        
                           " while trying to run it", 
        
                           ts.key, 
        
                           self.allowed_failures, 
        
                       ) 
        
           for ts in list(ws.has_what): 
        
               self.remove_replica(ts, ws) 
        
               if not ts.who_has: 
        
                   if ts.run_spec: 
        
                       recommendations[ts.key] = "released" 
        
                   else:  # pure data 
        
                       recommendations[ts.key] = "forgotten" 
        
           self.transitions(recommendations, stimulus_id=stimulus_id)

but the SchedulerShuffleExtension has not yet been called in

distributed/distributed/scheduler.py

Lines 4846 to 4852 in 53284cd

    
           for plugin in list(self.plugins.values()): 
        
               try: 
        
                   result = plugin.remove_worker(scheduler=self, worker=address) 
        
                   if inspect.isawaitable(result): 
        
                       await result 
        
               except Exception as e: 
        
                   logger.exception(e)

A similar case holds for relying on transition hooks as you already describe for 2.):

Note: A mere transition hook is not sufficient since a task failure would already generate a client message indicating a failure. the transition hooks do not allow us to intercept messages to clients (nor workers) and I don't think this would be a table API

We could do deduplication by (SourceID, attempt). Right now, we'd need to deduplicate by (SourceID, TargetID) but all TargetIDs are currently meshed up in the same bytestring.

Fair point.

fjetter · 2023-03-01T11:33:25Z

@hendrikmakait I believe this issue can be closed, can't it? Is there something missing?

hendrikmakait · 2023-03-06T08:54:35Z

We still lack the entire mechanism for automatic restarts. For now, we merely fail reliably.

fjetter added scheduling shuffle labels Dec 6, 2022

fjetter mentioned this issue Dec 6, 2022

RFC Sign every compute task with a unique counter to correlated responses #7372

Closed

This was referenced Dec 8, 2022

Issues with tasks completing on workers after being released and re-submitted #7356

Open

[QST]: p2p shuffle on large datasets #7380

Open

hendrikmakait mentioned this issue Dec 12, 2022

Restructure P2PShuffle extensions #7390

Merged

2 tasks

hendrikmakait mentioned this issue Jun 1, 2023

string[pyarrow] dtype does not roundtrip in P2P shuffling #7420

Closed

hendrikmakait mentioned this issue Jul 5, 2023

Automatically restart P2P shuffles when output worker leaves #7970

Merged

2 tasks

hendrikmakait closed this as completed in #7970 Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically restart P2P shuffles that were aborted due to leaving workers #7353

Automatically restart P2P shuffles that were aborted due to leaving workers #7353

fjetter commented Nov 25, 2022 •

edited

Loading

fjetter commented Dec 6, 2022

fjetter commented Dec 6, 2022 •

edited

Loading

fjetter commented Dec 6, 2022

hendrikmakait commented Dec 8, 2022

fjetter commented Dec 8, 2022

hendrikmakait commented Dec 8, 2022

fjetter commented Mar 1, 2023

hendrikmakait commented Mar 6, 2023

Automatically restart P2P shuffles that were aborted due to leaving workers #7353

Automatically restart P2P shuffles that were aborted due to leaving workers #7353

Comments

fjetter commented Nov 25, 2022 • edited Loading

fjetter commented Dec 6, 2022

fjetter commented Dec 6, 2022 • edited Loading

fjetter commented Dec 6, 2022

hendrikmakait commented Dec 8, 2022

fjetter commented Dec 8, 2022

hendrikmakait commented Dec 8, 2022

fjetter commented Mar 1, 2023

hendrikmakait commented Mar 6, 2023

fjetter commented Nov 25, 2022 •

edited

Loading

fjetter commented Dec 6, 2022 •

edited

Loading