-
Couldn't load subscription status.
- Fork 6.8k
Description
Describe the problem
It would be great if Tune could recover from interrupts/crashes of the entire cluster, including the driver. An initial version could simply just avoid re-execution of completed trials.
Must have:
- Recover the state of all completed trials so they aren't re-executed.
Nice to have:
- Recover the state of all running trials from their last checkpoint.
- Recover scheduler / suggestion state (perhaps by replaying trial results?)
cc @richardliaw @old-bear @hartikainen any thoughts here?
Tracking a list of issues related to fully functional fault tolerance in Tune:
#3235 Tune currently doesn't respect the event loop iteration invariant
#3242 + #3246 (#3239 optional but related)
#3238 Tune (Node) Fault Tolerance Tests (this also assumes node fault tolerance works)
#3264 ray.wait hangs when nodes fail
Then, we can introduce cluster fault tolerance with ballistic tests and such.