Skip to content

Nondeterministic reconstruction for actors #1344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

stephanie-wang
Copy link
Contributor

What do these changes do?

Local scheduler asynchronously updates each actor task's execution dependencies upon dispatch so that each actor task is dependent on the task executed immediately before. Then, during actor reconstruction, the local scheduler automatically follows the initial order of execution (within some error bound due to message asynchrony).

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2845/
Test PASSed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This seems pretty simple.

What (if any) are the race conditions where a local scheduler dies and the execution edges for a task haven't been written in Redis yet (but the results of that task have already been shipped to another machine)? It seems like this could happen if the local scheduler's Redis client is very slow.

# Wait for the forks to complete their tasks.
enqueue_tasks = ray.get(enqueue_tasks)
enqueue_tasks = [object_id for object_id_list in enqueue_tasks for
object_id in object_id_list]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be cleaner as

enqueue_tasks = [x[0] for x in enqueue_tasks]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks!

give_task_to_local_scheduler(
state, state->algorithm_state, *execution_spec,
state->actor_mapping[actor_id].local_scheduler_id);
if (DBClientID_equal(state->actor_mapping[actor_id].local_scheduler_id,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an optimization to avoid going through Redis when the local scheduler should keep the task for itself, right? It does not affect correctness, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it's an optimization. I think it would be nice to keep, since otherwise we get a bunch of spurious warning messages ("Local scheduler is trying to assign a task to itself."). You can see this if you increase the number of forks or tasks in the unit test that I added in actor_test.py.

@stephanie-wang
Copy link
Contributor Author

Yes, there will be a race condition there since we use asynchronous messages to update global control state. Any updates that are still in flight to the control state when the local scheduler dies will be lost. We only guarantee deterministic execution for tasks whose execution edge updates have gone through; the other tasks will re-execute but not necessarily in the same order.

I'll add a comment about this in the code.

@robertnishihara
Copy link
Collaborator

I also noticed that this PR has some test failures in Travis for the actor reconstruction tests. Any idea about those?

@stephanie-wang stephanie-wang force-pushed the actor-nondeterministic-reconstruction branch from 13400b4 to 9b77d36 Compare January 16, 2018 22:58
@AmplabJenkins
Copy link

Build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3264/
Test PASSed.

@AmplabJenkins
Copy link

Build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3265/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3266/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3279/
Test PASSed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still waiting for Travis...

@stephanie-wang stephanie-wang merged commit 74718ef into ray-project:master Jan 21, 2018
@stephanie-wang stephanie-wang deleted the actor-nondeterministic-reconstruction branch January 21, 2018 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants