-
Notifications
You must be signed in to change notification settings - Fork 6.2k
Nondeterministic reconstruction for actors #1344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nondeterministic reconstruction for actors #1344
Conversation
Merged build finished. Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This seems pretty simple.
What (if any) are the race conditions where a local scheduler dies and the execution edges for a task haven't been written in Redis yet (but the results of that task have already been shipped to another machine)? It seems like this could happen if the local scheduler's Redis client is very slow.
test/actor_test.py
Outdated
# Wait for the forks to complete their tasks. | ||
enqueue_tasks = ray.get(enqueue_tasks) | ||
enqueue_tasks = [object_id for object_id_list in enqueue_tasks for | ||
object_id in object_id_list] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be cleaner as
enqueue_tasks = [x[0] for x in enqueue_tasks]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks!
give_task_to_local_scheduler( | ||
state, state->algorithm_state, *execution_spec, | ||
state->actor_mapping[actor_id].local_scheduler_id); | ||
if (DBClientID_equal(state->actor_mapping[actor_id].local_scheduler_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an optimization to avoid going through Redis when the local scheduler should keep the task for itself, right? It does not affect correctness, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it's an optimization. I think it would be nice to keep, since otherwise we get a bunch of spurious warning messages ("Local scheduler is trying to assign a task to itself."
). You can see this if you increase the number of forks or tasks in the unit test that I added in actor_test.py
.
Yes, there will be a race condition there since we use asynchronous messages to update global control state. Any updates that are still in flight to the control state when the local scheduler dies will be lost. We only guarantee deterministic execution for tasks whose execution edge updates have gone through; the other tasks will re-execute but not necessarily in the same order. I'll add a comment about this in the code. |
I also noticed that this PR has some test failures in Travis for the actor reconstruction tests. Any idea about those? |
…tic reconstruction
13400b4
to
9b77d36
Compare
Build finished. Test PASSed. |
Test PASSed. |
Build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still waiting for Travis...
What do these changes do?
Local scheduler asynchronously updates each actor task's execution dependencies upon dispatch so that each actor task is dependent on the task executed immediately before. Then, during actor reconstruction, the local scheduler automatically follows the initial order of execution (within some error bound due to message asynchrony).