-
-
Notifications
You must be signed in to change notification settings - Fork 757
Open
Description
There are a few situations where we have a set of tasks that need to run together. If any of them fails or if a worker holding intermediate data fails then we'll need to retry the entire set of tasks. This comes up in a few situations:
- Distributed XGBoost
- Distributed deep learning
- Shuffling
These are especially tricky because there is out-of-band communication and state that Dask's normal mechanisms may not be able to track. What options do we have to make these systems robust?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels