Closed
Description
We may wish to allow users to define custom save
and restore
methods for checkpointing the state of an actor. That way, if a node dies, we can recreate the actors that were running on that node by restoring them from the most recent checkpoint and potentially replaying the tasks that happened after that checkpoint.
This is underspecified and may not be the best approach, so we should discuss this more.
- What are the concrete use cases where this would be beneficial?
- What is the right API to expose?
- What should the default behavior be?
- What use cases does this fail to help with?
One possible usage could be the following. Note that for many use cases (e.g., TensorFlow graphs), the save and load methods would probably need to be much more involved.
@ray.remote
class Foo(object):
def __init__(self):
self.counter = 0
def inc(self):
self.counter += 1
return self.counter
def __ray_save__(self):
return self.counter
def __ray_restore__(self, serialized):
"""The argument to this method is the output of __ray_save__."""
self.counter = serialized
Metadata
Metadata
Assignees
Labels
No labels