Actor checkpointing with object lineage reconstruction #1004

stephanie-wang · 2017-09-22T01:58:44Z

Actor checkpointing built on top of the existing reconstruction infrastructure.

Briefly, at every task interval for a given actor, a mock task responsible for saving and resuming checkpoints is submitted. The task tries to save a checkpoint to Redis during normal execution. During reconstruction, the task tries to resume from the most recent checkpoint. If the most recent checkpoint is older, then the task requests reconstruction of the preceding tasks.

AmplabJenkins · 2017-09-22T03:59:03Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-09-22T03:59:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1910/
Test FAILed.

AmplabJenkins · 2017-10-06T07:21:55Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-06T07:21:56Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2057/
Test PASSed.

AmplabJenkins · 2017-10-06T19:47:04Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-06T19:47:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2058/
Test PASSed.

AmplabJenkins · 2017-10-06T21:28:30Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-06T21:28:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2059/
Test PASSed.

AmplabJenkins · 2017-10-06T23:22:19Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-06T23:22:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2064/
Test PASSed.

AmplabJenkins · 2017-10-06T23:26:28Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-06T23:26:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2065/
Test PASSed.

AmplabJenkins · 2017-10-07T00:23:54Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-10-07T00:23:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2063/
Test FAILed.

pcmoritz · 2017-10-07T04:53:37Z

src/common/task.cc

@@ -217,7 +217,18 @@ ActorID TaskSpec_actor_id(TaskSpec *spec) {
 int64_t TaskSpec_actor_counter(TaskSpec *spec) {
  CHECK(spec);
  auto message = flatbuffers::GetRoot<TaskInfo>(spec);
-  return message->actor_counter();
+  int64_t actor_counter = message->actor_counter();
+  if (actor_counter < 0) {


why not "return std::abs(message->actor_counter());"?

pcmoritz · 2017-10-07T04:56:49Z

src/local_scheduler/local_scheduler_algorithm.cc

@@ -42,6 +42,7 @@ typedef struct {
   *  restrict the submission of tasks on actors to the process that created the
   *  actor. */
  int64_t task_counter;
+  int64_t assigned_task_counter;


can you document this?

pcmoritz

LGTM, can you fix the small comments and rebase it?

stephanie-wang · 2017-10-09T22:50:08Z

Great, thanks! I pushed the fixes.

AmplabJenkins · 2017-10-09T23:26:39Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-09T23:26:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2070/
Test PASSed.

AmplabJenkins · 2017-10-10T00:47:33Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-10T00:47:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2072/
Test PASSed.

robertnishihara · 2017-10-10T00:56:30Z

#605

AmplabJenkins · 2017-10-10T07:04:37Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-10T07:04:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2077/
Test PASSed.

…ted after task is successful

- Return new task counter in GetTaskRequest - Update worker state for actor tasks inside of the actor method executor

AmplabJenkins · 2017-10-11T19:59:14Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-11T19:59:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2088/
Test PASSed.

robertnishihara · 2017-10-11T22:01:35Z

src/common/task.cc

@@ -217,7 +217,14 @@ ActorID TaskSpec_actor_id(TaskSpec *spec) {
 int64_t TaskSpec_actor_counter(TaskSpec *spec) {
  CHECK(spec);
  auto message = flatbuffers::GetRoot<TaskInfo>(spec);
-  return message->actor_counter();
+  return std::abs(message->actor_counter());


Why std::abs? Is this negative sometimes? And if so, what does that mean?

I set the counter to be negative for checkpoint tasks. It was mostly to avoid adding another field to the task spec. Let me know if you prefer to just add a field.

instead of negative numbers, i actually think adding a bool to the TaskSpec saying whether it is a checkpoint task or not is cleaner, what do you think?

robertnishihara · 2017-10-11T22:02:01Z

src/common/task.cc

+  return std::abs(message->actor_counter());
+}
+
+int64_t TaskSpec_actor_is_checkpoint_method(TaskSpec *spec) {


Should this return a bool?

Yup, thanks!

robertnishihara · 2017-10-11T22:07:10Z

python/ray/actor.py

+        args = args[:-1]
+        if method_name == "__ray_checkpoint__":
+            # Execute the checkpoint task. NOTE(swang): Checkpoint methods
+            # should not throw an exception.


They definitely could throw an exception, right? E.g., if they try to pickle stuff, pickling can fail. Also, if we allow user-defined checkpointing methods (which may be necessary for neural nets and things like that), the user could have a bug in the checkpointing method.

Does that have any implications for this code? Maybe we should add a test case for the case where checkpointing fails?

Hmm what do you think the expected behavior should be if the checkpoint task does throw an exception? For both saving and resuming a checkpoint?

For me, I would probably expect saving a checkpoint to store the exception like a normal task, and then continue executing tasks on the actor. An exception while resuming a checkpoint could reconstruct the previous object, as if there were no successful checkpoints. How does that sound?

Yeah, once we figure out the expected behavior, a test case would be good. Right now a checkpoint task that throws an exception would just hang, since it would never put the dummy object.

robertnishihara · 2017-10-11T22:12:37Z

src/local_scheduler/format/local_scheduler.fbs

@@ -36,6 +36,11 @@ enum MessageType:int {
  PutObject
 }

+// This message is sent from a worker to a local scheduler.
+table GetTaskRequest {
+  task_success: bool;


What does it mean if task_success is True versus False? And why is this necessary?

This is a little messy, but it's used to notify the local scheduler of the actor's new task counter. For checkpoint tasks, the actor's task counter should only be updated if the task was successful (the checkpoint was actually saved or resumed). If the checkpoint isn't there, for example, then task_success is set to False and the local scheduler won't modify the actor's task counter.

AmplabJenkins · 2017-10-11T23:53:00Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-11T23:53:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2095/
Test PASSed.

AmplabJenkins · 2017-10-12T00:48:09Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-12T00:48:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2096/
Test PASSed.

robertnishihara · 2017-10-12T01:02:41Z

python/ray/actor.py

@@ -214,6 +322,10 @@ def export_actor(actor_id, class_id, actor_method_names, num_cpus, num_gpus,


 def make_actor(cls, num_cpus, num_gpus, checkpoint_interval):
+    # Add one to the checkpoint interval since we will insert a mock task for
+    # every checkpoint.
+    checkpoint_interval += 1


Not a big deal, but

@ray.remote(checkpoint_interval=0) class Foo(object): def __init__(self): pass f = Foo.remote()

causes an infinite loop :)

Ah yeah, this is a little unfortunate...

robertnishihara · 2017-10-12T02:19:19Z

python/ray/actor.py

+                    plasma_id.binary())
+                worker.local_scheduler_client.notify_unblocked()
+
+            return actor_checkpoint_failed, error


If I run the following,

import ray ray.init() @ray.remote(checkpoint_interval=1) class Foo(object): def __init__(self): pass def method(self): pass def __ray_save__(self): raise Exception("failure") f = Foo.remote()

Then the first call to

ray.get(f.method.remote())

fails with

Traceback (most recent call last): File "/Users/rkn/Workspace/ray/python/ray/worker.py", line 728, in _process_task *arguments) File "/Users/rkn/Workspace/ray/python/ray/actor.py", line 151, in actor_method_executor actor_checkpoint_failed, error = method(actor, *args) File "/Users/rkn/Workspace/ray/python/ray/actor.py", line 472, in __ray_checkpoint__ return actor_checkpoint_failed, error UnboundLocalError: local variable 'error' referenced before assignment

And the second call to

ray.get(f.method.remote())

hangs.

Not sure if we want to address that here.

robertnishihara · 2017-10-12T02:20:51Z

test/actor_test.py

+
+        ray.worker.cleanup()
+
+    def testCheckpointException(self):


This test hangs for me, and I see the error

Remote function __ray_checkpoint__ failed with: Traceback (most recent call last): File "/Users/rkn/Workspace/ray/python/ray/worker.py", line 728, in _process_task *arguments) File "/Users/rkn/Workspace/ray/python/ray/actor.py", line 151, in actor_method_executor actor_checkpoint_failed, error = method(actor, *args) File "/Users/rkn/Workspace/ray/python/ray/actor.py", line 472, in __ray_checkpoint__ return actor_checkpoint_failed, error UnboundLocalError: local variable 'error' referenced before assignment

This is probably a python 3 thing, I think I can fix it.

Ok, fixed it.

AmplabJenkins · 2017-10-12T02:29:48Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-12T02:29:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2097/
Test PASSed.

robertnishihara · 2017-10-12T02:32:09Z

python/ray/actor.py

+            previous_object_id = previous_object_id[0]
+            # Make sure that a previous object was given.
+            if previous_object_id is None:
+                return False


If this code path is ever taken, that's a bug, right (that is if previous_object_id is None)? Or can it happen normally?

Oh yeah, that's right.

robertnishihara

Looks good to me!

AmplabJenkins · 2017-10-12T03:18:28Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-12T03:18:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2099/
Test PASSed.

AmplabJenkins · 2017-10-12T06:12:46Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-12T06:12:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2104/
Test PASSed.

stephanie-wang force-pushed the actor-checkpoint branch from 0a95d7b to 3e332a3 Compare October 6, 2017 06:44

pcmoritz reviewed Oct 7, 2017

View reviewed changes

stephanie-wang force-pushed the actor-checkpoint branch from 68f04e5 to dfe51d1 Compare October 9, 2017 22:53

stephanie-wang added 3 commits October 11, 2017 12:21

Worker reports error in previous task, actor task counter is incremen…

6973c38

…ted after task is successful

Refactor actor task execution

c58c437

- Return new task counter in GetTaskRequest - Update worker state for actor tasks inside of the actor method executor

Manually invoked checkpoint method

8b3eb5c

stephanie-wang added 2 commits October 11, 2017 12:21

Add missing line

fe13536

Make actor reconstruction tests run faster

7ad3c3a

stephanie-wang force-pushed the actor-checkpoint branch from d740913 to 7ad3c3a Compare October 11, 2017 19:21

robertnishihara reviewed Oct 11, 2017

View reviewed changes

Unimportant whitespace.

d27d046

Unimportant whitespace.

88db173

robertnishihara reviewed Oct 12, 2017

View reviewed changes

stephanie-wang added 3 commits October 11, 2017 18:54

Update checkpoint method signature

4306d87

Documentation and handle exceptions during checkpoint save/resume

b282b45

Rename get_task message field to actor_checkpoint_failed

84d8ab5

robertnishihara reviewed Oct 12, 2017

View reviewed changes

Fix bug.

947d4b4

robertnishihara approved these changes Oct 12, 2017

View reviewed changes

Remove debugging check, redirect test output

379718f

robertnishihara merged commit 3764f2f into ray-project:master Oct 12, 2017

robertnishihara deleted the actor-checkpoint branch October 12, 2017 16:53

Actor checkpointing with object lineage reconstruction #1004

Actor checkpointing with object lineage reconstruction #1004

Uh oh!

Conversation

stephanie-wang commented Sep 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Sep 22, 2017

Uh oh!

AmplabJenkins commented Sep 22, 2017

Uh oh!

AmplabJenkins commented Oct 6, 2017

Uh oh!

AmplabJenkins commented Oct 6, 2017

Uh oh!

AmplabJenkins commented Oct 6, 2017

Uh oh!

AmplabJenkins commented Oct 6, 2017

Uh oh!

AmplabJenkins commented Oct 6, 2017

Uh oh!

AmplabJenkins commented Oct 6, 2017

Uh oh!

AmplabJenkins commented Oct 6, 2017

Uh oh!

AmplabJenkins commented Oct 6, 2017

Uh oh!

AmplabJenkins commented Oct 6, 2017

Uh oh!

AmplabJenkins commented Oct 6, 2017

Uh oh!

AmplabJenkins commented Oct 7, 2017

Uh oh!

AmplabJenkins commented Oct 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcmoritz left a comment

Choose a reason for hiding this comment

Uh oh!

stephanie-wang commented Oct 9, 2017

Uh oh!

AmplabJenkins commented Oct 9, 2017

Uh oh!

AmplabJenkins commented Oct 9, 2017

Uh oh!

AmplabJenkins commented Oct 10, 2017

Uh oh!

AmplabJenkins commented Oct 10, 2017

Uh oh!

robertnishihara commented Oct 10, 2017

Uh oh!

AmplabJenkins commented Oct 10, 2017

Uh oh!

AmplabJenkins commented Oct 10, 2017

Uh oh!

AmplabJenkins commented Oct 11, 2017

Uh oh!

AmplabJenkins commented Oct 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang commented Sep 22, 2017 •

edited

Loading

robertnishihara Oct 12, 2017 •

edited

Loading

robertnishihara Oct 12, 2017 •

edited

Loading