Skip to content

[tune] Application level fault tolerance for large hyperparam searches #2840

@ericl

Description

@ericl

Describe the problem

It would be great if Tune could recover from interrupts/crashes of the entire cluster, including the driver. An initial version could simply just avoid re-execution of completed trials.

Must have:

  • Recover the state of all completed trials so they aren't re-executed.

Nice to have:

  • Recover the state of all running trials from their last checkpoint.
  • Recover scheduler / suggestion state (perhaps by replaying trial results?)

cc @richardliaw @old-bear @hartikainen any thoughts here?


Tracking a list of issues related to fully functional fault tolerance in Tune:

#3235 Tune currently doesn't respect the event loop iteration invariant
#3242 + #3246 (#3239 optional but related)
#3238 Tune (Node) Fault Tolerance Tests (this also assumes node fault tolerance works)
#3264 ray.wait hangs when nodes fail

Then, we can introduce cluster fault tolerance with ballistic tests and such.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions