Nondeterministic reconstruction for actors #1344

stephanie-wang · 2017-12-19T03:56:41Z

What do these changes do?

Local scheduler asynchronously updates each actor task's execution dependencies upon dispatch so that each actor task is dependent on the task executed immediately before. Then, during actor reconstruction, the local scheduler automatically follows the initial order of execution (within some error bound due to message asynchrony).

AmplabJenkins · 2017-12-19T04:45:21Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-12-19T04:45:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2845/
Test PASSed.

robertnishihara

Nice! This seems pretty simple.

What (if any) are the race conditions where a local scheduler dies and the execution edges for a task haven't been written in Redis yet (but the results of that task have already been shipped to another machine)? It seems like this could happen if the local scheduler's Redis client is very slow.

robertnishihara · 2017-12-19T21:42:27Z

test/actor_test.py

+        # Wait for the forks to complete their tasks.
+        enqueue_tasks = ray.get(enqueue_tasks)
+        enqueue_tasks = [object_id for object_id_list in enqueue_tasks for
+                         object_id in object_id_list]


I think this would be cleaner as

enqueue_tasks = [x[0] for x in enqueue_tasks]

Will do, thanks!

robertnishihara · 2017-12-19T22:05:02Z

src/local_scheduler/local_scheduler_algorithm.cc

-  give_task_to_local_scheduler(
-      state, state->algorithm_state, *execution_spec,
-      state->actor_mapping[actor_id].local_scheduler_id);
+  if (DBClientID_equal(state->actor_mapping[actor_id].local_scheduler_id,


This is an optimization to avoid going through Redis when the local scheduler should keep the task for itself, right? It does not affect correctness, right?

Yup, it's an optimization. I think it would be nice to keep, since otherwise we get a bunch of spurious warning messages ("Local scheduler is trying to assign a task to itself."). You can see this if you increase the number of forks or tasks in the unit test that I added in actor_test.py.

stephanie-wang · 2017-12-24T01:08:21Z

Yes, there will be a race condition there since we use asynchronous messages to update global control state. Any updates that are still in flight to the control state when the local scheduler dies will be lost. We only guarantee deterministic execution for tasks whose execution edge updates have gone through; the other tasks will re-execute but not necessarily in the same order.

I'll add a comment about this in the code.

robertnishihara · 2017-12-26T20:37:44Z

I also noticed that this PR has some test failures in Travis for the actor reconstruction tests. Any idea about those?

…tic reconstruction

AmplabJenkins · 2018-01-16T23:16:06Z

Build finished. Test PASSed.

AmplabJenkins · 2018-01-16T23:16:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3264/
Test PASSed.

AmplabJenkins · 2018-01-16T23:39:28Z

Build finished. Test PASSed.

AmplabJenkins · 2018-01-16T23:39:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3265/
Test PASSed.

AmplabJenkins · 2018-01-17T00:02:57Z

Merged build finished. Test PASSed.

AmplabJenkins · 2018-01-17T00:02:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3266/
Test PASSed.

AmplabJenkins · 2018-01-18T02:47:41Z

Merged build finished. Test PASSed.

AmplabJenkins · 2018-01-18T02:47:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3279/
Test PASSed.

robertnishihara

Still waiting for Travis...

robertnishihara reviewed Dec 19, 2017

View reviewed changes

stephanie-wang added 7 commits January 16, 2018 14:53

Add failing unit test for nondeterministic reconstruction

059f527

Update execution edges asynchronously upon dispatch for nondeterminis…

9c568b1

…tic reconstruction

Retry scheduling actor tasks if reassigned to local scheduler

78a7409

Fix bug for updating checkpoint task execution dependencies

3248244

Update comments for deterministic reconstruction

98d67b3

cleanup

ab64efb

Add (and skip) failing test case for nondeterministic reconstruction

9b77d36

stephanie-wang force-pushed the actor-nondeterministic-reconstruction branch from 13400b4 to 9b77d36 Compare January 16, 2018 22:58

Suppress test output

cd82a3c

robertnishihara approved these changes Jan 18, 2018

View reviewed changes

stephanie-wang merged commit 74718ef into ray-project:master Jan 21, 2018

stephanie-wang deleted the actor-nondeterministic-reconstruction branch January 21, 2018 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nondeterministic reconstruction for actors #1344

Nondeterministic reconstruction for actors #1344

stephanie-wang commented Dec 19, 2017

AmplabJenkins commented Dec 19, 2017

AmplabJenkins commented Dec 19, 2017

robertnishihara left a comment

robertnishihara Dec 19, 2017

stephanie-wang Dec 24, 2017

robertnishihara Dec 19, 2017

stephanie-wang Dec 24, 2017

stephanie-wang commented Dec 24, 2017

robertnishihara commented Dec 26, 2017

AmplabJenkins commented Jan 16, 2018

AmplabJenkins commented Jan 16, 2018

AmplabJenkins commented Jan 16, 2018

AmplabJenkins commented Jan 16, 2018

AmplabJenkins commented Jan 17, 2018

AmplabJenkins commented Jan 17, 2018

AmplabJenkins commented Jan 18, 2018

AmplabJenkins commented Jan 18, 2018

robertnishihara left a comment

Nondeterministic reconstruction for actors #1344

Nondeterministic reconstruction for actors #1344

Conversation

stephanie-wang commented Dec 19, 2017

What do these changes do?

AmplabJenkins commented Dec 19, 2017

AmplabJenkins commented Dec 19, 2017

robertnishihara left a comment

Choose a reason for hiding this comment

robertnishihara Dec 19, 2017

Choose a reason for hiding this comment

stephanie-wang Dec 24, 2017

Choose a reason for hiding this comment

robertnishihara Dec 19, 2017

Choose a reason for hiding this comment

stephanie-wang Dec 24, 2017

Choose a reason for hiding this comment

stephanie-wang commented Dec 24, 2017

robertnishihara commented Dec 26, 2017

AmplabJenkins commented Jan 16, 2018

AmplabJenkins commented Jan 16, 2018

AmplabJenkins commented Jan 16, 2018

AmplabJenkins commented Jan 16, 2018

AmplabJenkins commented Jan 17, 2018

AmplabJenkins commented Jan 17, 2018

AmplabJenkins commented Jan 18, 2018

AmplabJenkins commented Jan 18, 2018

robertnishihara left a comment

Choose a reason for hiding this comment