Skip to content

Use checkpointing for actor fault tolerance. #605

Closed
@robertnishihara

Description

@robertnishihara

We may wish to allow users to define custom save and restore methods for checkpointing the state of an actor. That way, if a node dies, we can recreate the actors that were running on that node by restoring them from the most recent checkpoint and potentially replaying the tasks that happened after that checkpoint.

This is underspecified and may not be the best approach, so we should discuss this more.

  1. What are the concrete use cases where this would be beneficial?
  2. What is the right API to expose?
  3. What should the default behavior be?
  4. What use cases does this fail to help with?

cc @stephanie-wang

One possible usage could be the following. Note that for many use cases (e.g., TensorFlow graphs), the save and load methods would probably need to be much more involved.

@ray.remote
class Foo(object):
    def __init__(self):
        self.counter = 0

    def inc(self):
        self.counter += 1
        return self.counter

    def __ray_save__(self):
        return self.counter

    def __ray_restore__(self, serialized):
        """The argument to this method is the output of __ray_save__."""
        self.counter = serialized

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions