Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stability] Make task result for RemoveTask optional #5146

Merged
merged 7 commits into from
Jul 10, 2019

Conversation

pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Jul 9, 2019

What do these changes do?

This fixes the long-running stress test https://github.com/ray-project/ray/blob/master/ci/long_running_tests/workloads/many_drivers.py. It fixes a race condition where Task information is removed when a driver is disconnected but then still further used.

The current hypothesis is that this concerns only the SWAP queue, so there are still some checks left in. If other queues are also affected, we might need to make them non-fatal too.

Related issue number

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15217/
Test FAILed.

@pcmoritz pcmoritz changed the title Make task result for RemoveTask optional [stability] Make task result for RemoveTask optional Jul 9, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1525/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15219/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1527/
Test FAILed.

@stephanie-wang stephanie-wang self-requested a review July 9, 2019 17:42
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I am a little confused, though, about when we should RAY_CHECK the RemoveTask return value and when we should just exit if no task was removed... can the bug affect any of the RemoveTask calls or is it only some of them?

@@ -247,7 +247,8 @@ std::vector<Task> SchedulingQueue::RemoveTasks(std::unordered_set<TaskID> &task_
return removed_tasks;
}

Task SchedulingQueue::RemoveTask(const TaskID &task_id, TaskState *removed_task_state) {
boost::optional<Task> SchedulingQueue::RemoveTask(const TaskID &task_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about a signature like this instead?

bool SchedulingQueue::RemoveTask(const TaskID &task_id, TaskState *removed_task_state, Task *removed_task);

I was thinking that it might be good to make sure that the caller doesn't try to use the TaskState if the function returns false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that first (and it's nicer), but it involved needing a default constructor for Task and TaskSpecification and TaskExecutionSpecification :)

@@ -1035,6 +1039,8 @@ void NodeManager::ProcessDisconnectClientMessage(
<< "job_id: " << worker->GetAssignedJobId();
}

client->Close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this part of the bug?

Copy link
Contributor Author

@pcmoritz pcmoritz Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was another bug where we were running out of file descriptors.

@stephanie-wang stephanie-wang self-assigned this Jul 9, 2019
@pcmoritz
Copy link
Contributor Author

pcmoritz commented Jul 9, 2019

I updated the PR description with the current theory of which calls it affects.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1552/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1553/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1554/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1555/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15244/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15245/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15246/
Test PASSed.

@pcmoritz
Copy link
Contributor Author

@stephanie-wang This is ready to merge

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good!

Actually, thinking about it more, I'm pretty sure that this bug can come up pretty much anywhere we call RemoveTask. But let's see how this works first...

@stephanie-wang stephanie-wang merged commit e6a81d4 into ray-project:master Jul 10, 2019
@stephanie-wang stephanie-wang deleted the fix-many-drivers-2 branch July 10, 2019 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants