Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry collection of tasks on failed worker #5403

Open
mrocklin opened this issue Oct 8, 2021 · 2 comments
Open

Retry collection of tasks on failed worker #5403

mrocklin opened this issue Oct 8, 2021 · 2 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented Oct 8, 2021

There are a few situations where we have a set of tasks that need to run together. If any of them fails or if a worker holding intermediate data fails then we'll need to retry the entire set of tasks. This comes up in a few situations:

  1. Distributed XGBoost
  2. Distributed deep learning
  3. Shuffling

These are especially tricky because there is out-of-band communication and state that Dask's normal mechanisms may not be able to track. What options do we have to make these systems robust?

cc @gjoseph92 @jcrist

@mrocklin
Copy link
Member Author

mrocklin commented Oct 8, 2021

I had a chat with Jim a couple of nights ago. He brought up some ideas around virtual tasks. I'll let him or Gabe speak to that in more depth.

My plan to resolve this was to link a set of tasks to retry others. There are a couple of ways to do this:

  1. Add a Retry or Rerun exception that can be used from within user tasks. Every final task might then check that the set of workers hasn't shrunk, and if so it would call out a rerun of an initial set of tasks that would retrigger the entire computation
  2. This could be handled on the scheduler as a special functionality. This would make consistency easier (there are some interesting fail cases for option 1) but would require functionality to be added deeper into Dask

There are challenges to each of the two options above, but my guess is that they are solvable. There are hopefully simpler solutions that other folks can come up with.

@fjetter
Copy link
Member

fjetter commented Oct 14, 2021

We had a constructive synchronous discussion about the topic with @jcrist , @gjoseph92, @crusaderky and myself. I'll provide a summary tomorrow but raw notes are available on https://docs.google.com/document/d/1Td5Yg1d96xSgLM-c8RnjLN0UAVBg6pDfONTsPDq0Fps/edit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants