-
Notifications
You must be signed in to change notification settings - Fork 6.2k
Replace actor dummy objects with mock calls to the local scheduler #1467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace actor dummy objects with mock calls to the local scheduler #1467
Conversation
Merged build finished. Test FAILed. |
Test FAILed. |
@@ -580,7 +580,8 @@ void assign_task_to_worker(LocalSchedulerState *state, | |||
|
|||
void finish_task(LocalSchedulerState *state, | |||
LocalSchedulerClient *worker, | |||
bool actor_checkpoint_failed) { | |||
bool actor_checkpoint_failed, | |||
std::vector<ObjectID> execution_returns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass as const ref
* @return Void. | ||
*/ | ||
void finish_task(LocalSchedulerState *state, | ||
LocalSchedulerClient *worker, | ||
bool actor_checkpoint_failed); | ||
bool actor_checkpoint_failed, | ||
std::vector<ObjectID> execution_returns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const ref
@@ -47,6 +47,9 @@ table GetTaskRequest { | |||
// If true, then the local scheduler will not update the actor's task | |||
// counter to match the assigned checkpoint index. | |||
actor_checkpoint_failed: bool; | |||
// The list of stateful dependencies that were fulfilled by the task that | |||
// this worker just finished executing. | |||
execution_returns: [string]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about why this needs to go through the worker at all. If I understand correctly, the following happens:
- A local scheduler assigns a task to a worker.
- The actor takes the last object ID from the task spec and calls that the dummy object ID.
- The actor passes the dummy object ID back to the local scheduler in
execution_returns
. - The local scheduler designates that object as locally available.
Since all of the information is in the local scheduler to begin with, couldn't we just bypass the actor entirely and have the local scheduler do the bookkeeping for itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For normal tasks, that's okay, but I think we'll need something like this for checkpointing. The future implementation for checkpointing should have a way for the actor to pass to the local scheduler which execution edges were included in the resumed checkpoint.
I actually agree that what you suggested is a better way to do it, and then later add a separate IPC for the special case of checkpoint resumption, but I think that will break the current checkpointing implementation. What do you think about temporarily disallowing checkpointing as part of this PR? I was going to rewrite checkpointing in my next PR anyway but wanted to get this one merged first. Another option is to do that now and fix up the current checkpointing implementation accordingly, but I'm wary of spending the effort to do that when it will get scrapped soon anyway.
dfc2953
to
1ea7a0d
Compare
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test FAILed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you didn't need to disable any tests?
This looks good to me. Can you rebase this on the master and push to ray-private-travis to retrigger the tests? (I suggest rebasing because there are some test failures that were fixed this morning relating to a new version of gym
that was pushed yesterday.)
1ea7a0d
to
34e3ec7
Compare
Yep, I just added a check on the local scheduler to make sure that the checkpoint task succeeded before calling the handler for an available object. We can remove the check in the future implementation of checkpointing. |
Merged build finished. Test PASSed. |
Test PASSed. |
What do these changes do?
Each actor task returns a dummy object ID to represent stateful execution dependencies. Previously, actors would store empty objects in the local object store, which takes up space and also makes the objects appear globally available, even though they should only be used by the actor's corresponding local scheduler.
This PR solves these issues by replacing the dummy objects with a call to the local scheduler's handler for objects becoming available.
Related issue number
Resolves #1219.