You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a few situations where we have a set of tasks that need to run together. If any of them fails or if a worker holding intermediate data fails then we'll need to retry the entire set of tasks. This comes up in a few situations:
Distributed XGBoost
Distributed deep learning
Shuffling
These are especially tricky because there is out-of-band communication and state that Dask's normal mechanisms may not be able to track. What options do we have to make these systems robust?
I had a chat with Jim a couple of nights ago. He brought up some ideas around virtual tasks. I'll let him or Gabe speak to that in more depth.
My plan to resolve this was to link a set of tasks to retry others. There are a couple of ways to do this:
Add a Retry or Rerun exception that can be used from within user tasks. Every final task might then check that the set of workers hasn't shrunk, and if so it would call out a rerun of an initial set of tasks that would retrigger the entire computation
This could be handled on the scheduler as a special functionality. This would make consistency easier (there are some interesting fail cases for option 1) but would require functionality to be added deeper into Dask
There are challenges to each of the two options above, but my guess is that they are solvable. There are hopefully simpler solutions that other folks can come up with.
There are a few situations where we have a set of tasks that need to run together. If any of them fails or if a worker holding intermediate data fails then we'll need to retry the entire set of tasks. This comes up in a few situations:
These are especially tricky because there is out-of-band communication and state that Dask's normal mechanisms may not be able to track. What options do we have to make these systems robust?
cc @gjoseph92 @jcrist
The text was updated successfully, but these errors were encountered: