Retry collection of tasks on failed worker

There are a few situations where we have a set of tasks that need to run together.  If any of them fails or if a worker holding intermediate data fails then we'll need to retry the entire set of tasks.  This comes up in a few situations:

1.  Distributed XGBoost
2. Distributed deep learning
3. Shuffling

These are especially tricky because there is out-of-band communication and state that Dask's normal mechanisms may not be able to track.  What options do we have to make these systems robust?

cc @gjoseph92 @jcrist 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Retry collection of tasks on failed worker #5403

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Retry collection of tasks on failed worker #5403

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions