Refactor actor task queues #1118

stephanie-wang · 2017-10-13T00:31:07Z

This does a partial refactor to better unify local scheduling for actor tasks and regular tasks. Actor tasks are now added to the waiting queue if their object dependencies are not fulfilled. They get added to their corresponding actor's queue once all dependencies are local.

AmplabJenkins · 2017-10-13T02:34:56Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-10-13T02:34:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2110/
Test FAILed.

robertnishihara · 2017-10-13T04:53:53Z

src/local_scheduler/local_scheduler_algorithm.cc

@@ -1376,17 +1395,42 @@ void handle_object_removed(LocalSchedulerState *state,
    }
  }

+  std::vector<ActorID> empty_actor_queues;


I had completely forgotten about this handle_object_removed function. This has always felt super expensive to me, but I don't see a good alternative.

Yeah, same for me. A better way is probably to remove the mapping to dependent tasks only when the task is actually assigned. We should probably add a stress test for this behavior where there is a lot of churn in the object store.

robertnishihara · 2017-10-13T05:14:10Z

src/local_scheduler/local_scheduler_algorithm.cc

+    /* If a task was assigned to this actor and there was no checkpoint
+     * failure, then it is now loaded. */
+    if (entry.assigned_task_counter > -1) {
+      entry.loaded = true;


If it happened to rerun an old checkpoint task, then this will prevent it from running newer checkpoint tasks, right?

Is this avoided in practice by only storing the most recent checkpoint so that all earlier checkpoint tasks fail?

Yeah, each checkpoint task deletes earlier checkpoints.

In the future, we can optimize this a bit by having any checkpoint task try to reload the most recent checkpoint, not just the one with a matching index. We'll have to change the response to the local scheduler from actor_checkpoint_failed to an actual integer value for the new counter though. We should think about maybe making that a separate IPC.

robertnishihara · 2017-10-13T05:17:15Z

src/local_scheduler/local_scheduler_algorithm.cc

+  /** Whether or not to request a transfer of this object. This should be set
+   *  to true for all objects except for actor dummy objects, where the object
+   *  must be generated by executing the task locally. */
+  bool request_transfer;


It doesn't look like we're ever using this field. We set it in fetch_missing_dependency, but we never actually use the value, right?

Also, what could go wrong if we do fetch an actor dummy object?

Ah oops, that was my bad. It's supposed to check the field in fetch_object_timeout_handler. I'll fix that and see if I can come up with a test case for it as well.

The actor dummy objects are only supposed to be generated by the actor itself. Basically, the scenario where it would break reconstruction is if a local scheduler dies, but the corresponding plasma manager is still reachable. Then, when reconstructing the dummy object, if we transfer the object from the surviving plasma manager, it will look like we executed that task, even though we didn't. I'll clarify the documentation for that scenario.

robertnishihara · 2017-10-13T05:29:16Z

src/local_scheduler/local_scheduler_algorithm.cc

+   *  checkpoint. Before the actor has loaded, we may dispatch the first task
+   *  or any checkpoint tasks. After it has loaded, we may only dispatch tasks
+   *  in order. */
+  bool loaded;


Won't this prevent efficient recovery from checkpoints? E.g., during recovery, why won't the actor just rerun the constructor or some early checkpoint task and then be prevented from recovering from a later checkpoint?

You could sort of solve this by only keeping around the latest checkpoint, but it could still rerun the constructor.

Ok, I think I was a bit confused. I thought the ith checkpoint task loaded the ith checkpoint, but actually it loads the most recent checkpoint, is that right?

Hmm, I think it will be okay since only certain tasks will get resubmitted during recovery. Earlier checkpoint tasks will fail since their checkpoint isn't available. It is possible that the constructor could run first, but that will only happen if someone actually needs the results of a task that happened before the checkpoint. Otherwise, the task won't get resubmitted.

For the latter case, it could be good to check for a checkpoint in the constructor (similar to what you had before). I'll add a TODO.

Nope, it does load the ith checkpoint, but only if it exists. Else, it'll set the actor_checkpoint_failed field in the response to the local scheduler.

robertnishihara · 2017-10-13T05:30:19Z

src/local_scheduler/local_scheduler_algorithm.cc

-  /* Find the first task that either matches the task counter or that is a
-   * checkpoint method. Remove any tasks that we have already executed past
-   * (e.g., by executing a more recent checkpoint method). */
+  /* Check whether we can execute the first task in the queue. */


Because of the dummy object dependencies, does this queue in practice only have 0 or 1 tasks in it (the other tasks being in the waiting queue)?

I think that will often be the case, but it will also have checkpoint tasks.

robertnishihara · 2017-10-13T07:49:34Z

Scrolling to the end of the Jenkins logs, it looks like the local scheduler died (or was marked as dead) in one of the tests. Could that be related to this PR?

robertnishihara · 2017-10-13T07:49:46Z

retest this please

AmplabJenkins · 2017-10-13T09:54:56Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-10-13T09:54:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2113/
Test FAILed.

…or_task_queue

AmplabJenkins · 2017-10-14T00:40:04Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-14T00:40:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2118/
Test PASSed.

robertnishihara · 2017-10-14T02:17:17Z

src/local_scheduler/local_scheduler_algorithm.cc

+    /* If the task is for an actor, and the missing object is a dummy object,
+     * then we must generate it locally by executing the corresponding task.
+     * All other objects may be requested from another plasma manager. */
+    if (TaskSpec_is_actor_task(task_entry_it->spec) &&


It seems to me like we're still never calling fetch_missing_dependencies on actor tasks. Or are we somewhere?

Oh my mistake, I see that it happens through queue_task_locally.

robertnishihara reviewed Oct 13, 2017

View reviewed changes

stephanie-wang added 3 commits October 13, 2017 15:56

Refactor actor task queue to share the waiting task queue

6307dd8

Refactor add_task_to_actor_queue into queue_actor_task and insert_act…

4b47fbe

…or_task_queue

Fix

c56b34f

stephanie-wang force-pushed the actor-task-queue branch from d06ed91 to c56b34f Compare October 14, 2017 00:04

robertnishihara reviewed Oct 14, 2017

View reviewed changes

robertnishihara mentioned this pull request Oct 14, 2017

The ActorReconstruction.testManyLocalSchedulersDying test in actor_test.py sometimes hangs. #1103

Closed

robertnishihara approved these changes Oct 14, 2017

View reviewed changes

robertnishihara merged commit 15486a1 into ray-project:master Oct 14, 2017

robertnishihara deleted the actor-task-queue branch October 14, 2017 03:52

robertnishihara mentioned this pull request Oct 14, 2017

Actors don't seem to recover when using ray.wait() #1036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor actor task queues #1118

Refactor actor task queues #1118

stephanie-wang commented Oct 13, 2017

AmplabJenkins commented Oct 13, 2017

AmplabJenkins commented Oct 13, 2017

robertnishihara Oct 13, 2017

stephanie-wang Oct 13, 2017

robertnishihara Oct 13, 2017

stephanie-wang Oct 13, 2017

robertnishihara Oct 13, 2017

stephanie-wang Oct 13, 2017

robertnishihara Oct 13, 2017

robertnishihara Oct 13, 2017

stephanie-wang Oct 13, 2017

robertnishihara Oct 13, 2017

stephanie-wang Oct 13, 2017

robertnishihara commented Oct 13, 2017

robertnishihara commented Oct 13, 2017

AmplabJenkins commented Oct 13, 2017

AmplabJenkins commented Oct 13, 2017

AmplabJenkins commented Oct 14, 2017

AmplabJenkins commented Oct 14, 2017

robertnishihara Oct 14, 2017 •

edited

Loading

robertnishihara Oct 14, 2017

Refactor actor task queues #1118

Refactor actor task queues #1118

Conversation

stephanie-wang commented Oct 13, 2017

AmplabJenkins commented Oct 13, 2017

AmplabJenkins commented Oct 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertnishihara commented Oct 13, 2017

robertnishihara commented Oct 13, 2017

AmplabJenkins commented Oct 13, 2017

AmplabJenkins commented Oct 13, 2017

AmplabJenkins commented Oct 14, 2017

AmplabJenkins commented Oct 14, 2017

robertnishihara Oct 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertnishihara Oct 14, 2017 •

edited

Loading