Open
Description
There are a few situations where we have a set of tasks that need to run together. If any of them fails or if a worker holding intermediate data fails then we'll need to retry the entire set of tasks. This comes up in a few situations:
- Distributed XGBoost
- Distributed deep learning
- Shuffling
These are especially tricky because there is out-of-band communication and state that Dask's normal mechanisms may not be able to track. What options do we have to make these systems robust?
Metadata
Metadata
Assignees
Labels
No labels