Use checkpointing for actor fault tolerance.

We may wish to allow users to define custom `save` and `restore` methods for checkpointing the state of an actor. That way, if a node dies, we can recreate the actors that were running on that node by restoring them from the most recent checkpoint and potentially replaying the tasks that happened after that checkpoint.

This is underspecified and may not be the best approach, so we should discuss this more.

1. What are the concrete use cases where this would be beneficial?
2. What is the right API to expose?
3. What should the default behavior be?
4. What use cases does this fail to help with?

cc @stephanie-wang

One possible usage could be the following. Note that for many use cases (e.g., TensorFlow graphs), the save and load methods would probably need to be much more involved.

```python
@ray.remote
class Foo(object):
    def __init__(self):
        self.counter = 0

    def inc(self):
        self.counter += 1
        return self.counter

    def __ray_save__(self):
        return self.counter

    def __ray_restore__(self, serialized):
        """The argument to this method is the output of __ray_save__."""
        self.counter = serialized
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use checkpointing for actor fault tolerance. #605

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use checkpointing for actor fault tolerance. #605

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions