Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mark all remaining placeable tasks pending with task dependency manager #2528

Merged

Conversation

atumanov
Copy link
Contributor

What do these changes do?

This PR registers all remaining placeable tasks with the task dependency manager, to ensure reconstruction policy is not triggered on them.

Related issue number

#2527

@stephanie-wang
Copy link
Contributor

Thanks! Do we expect there to be many tasks in the placeable queue? Just wondering if there is some way we can avoid iterating through the queue every time we call ScheduleTasks().

@@ -750,6 +750,12 @@ void NodeManager::ScheduleTasks() {
EnqueuePlaceableTask(t);
}
}

// All remaining placeable tasks should be registered with the task dependency
// manager.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note here saying that it's ok to call TaskPending multiple times on the same task?

@robertnishihara
Copy link
Collaborator

Maybe the right time to do this is when we put something into the Placeable queue?

@atumanov
Copy link
Contributor Author

hmm, good point. Other options I can think of are :

  1. mark all new tasks pending with the task dependency manager on SubmitTask (so there's a tradeoff between the overhead of that and iterating over remaining placeable queue).
  2. piggyback the policy, which already iterates over the queue. (This is undesirable, because it breaks the abstraction, as we'd need the policy to have access to /handle on the task dependency manager).
  3. since it's a fifo queue , we can keep a pointer to the last placeable task that was registered. This could be kept as an extra private field in the SchedulingQueue class.

Depending on the policy, this queue may be large.

@atumanov
Copy link
Contributor Author

@robertnishihara , I think it might be worth trying both and comparing performance. I don't have a good feel (yet) for the overhead of task_dependency_manager_.TaskPending().

@stephanie-wang
Copy link
Contributor

The overhead of calling task_dependency_manager_.TaskPending() is in allocating a timer and a GCS operation to acquire the task lease. @atumanov, we originally used option 1, but it ended up being very inefficient because many of the tasks got forwarded to another node immediately afterward. But if this isn't happening too often anymore with #2420, then maybe option 1 is now okay.

@robertnishihara
Copy link
Collaborator

@stephanie-wang there's no additional timer if we call the method twice on the same task, right?

@stephanie-wang
Copy link
Contributor

@robertnishihara, that's right, the method will just exit if it gets called twice on the same task.

The main concern is about calling task_dependency_manager_.TaskPending on a task that will get task_dependency_manager_.TaskCanceled (e.g., due to getting forwarded to another node) soon. Calling task_dependency_manager_.TaskPending on the same task twice is probably okay.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7069/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7072/
Test FAILed.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think we should just merge this to unblock #2420. The iteration through the placeable queue is probably okay, since we're doing it anyway in ScheduleTasks(). We can come back to it if it becomes an issue.

@stephanie-wang
Copy link
Contributor

jenkins, retest this please

@robertnishihara
Copy link
Collaborator

@atumanov we can merge it in its current state, but please add a comment describing the potential performance problem.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7104/
Test PASSed.

@atumanov
Copy link
Contributor Author

atumanov commented Aug 2, 2018

Added the TODO, this should be ready to merge when tests pass.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7157/
Test PASSed.

@robertnishihara
Copy link
Collaborator

To summarize some points from a conversation that @atumanov and I had in person, the main confusion I have with this PR is that the developer has to have a lot of knowledge to determine that all tasks are correctly registered with the task_dependency_manager_. It would be conceptually much simpler if we could maintain an invariant like "a task is registered with the task_dependency_manager_ if and only if it is in local_queues_. However, there is a performance issue in this case when we queue something and then immediately forward it to another machine.

I'm merging this to unblock the scheduling PR, but we should redesign this to reduce the cognitive load.

@robertnishihara robertnishihara merged commit 85b8b2a into ray-project:master Aug 6, 2018
@robertnishihara robertnishihara deleted the placeable-noreconstruct branch August 6, 2018 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants