Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xray] Fix bug when counting a task's lineage size #2600

Merged

Conversation

stephanie-wang
Copy link
Contributor

What do these changes do?

In the lineage cache, we sometimes count the size of the uncommitted lineage of a task so that we can decide whether we should evict it. The method to count the size walks the task's lineage tree but does not consider whether a node has been seen before, so it ends up double-counting any nodes that appear twice in the tree.

In the test case python -m pytest test/stress_tests.py::test_dependencies[ray_start_combination1], which submits tasks with many interleaving dependencies, this resulted in crashes of the raylet because a handler on the event loop could end up spending minutes in the CountUnsubscribedLineage method. Then, heartbeats were not sent quickly enough and the raylet would be marked as dead.

This PR adds back the test/stress_tests.py::test_dependencies and test/stress_tests.py::test_submitting_many_tasks tests for raylet.

@@ -211,7 +211,12 @@ void LineageCache::AddReadyTask(const Task &task) {
}
}

uint64_t LineageCache::CountUnsubscribedLineage(const TaskID &task_id) const {
uint64_t LineageCache::CountUnsubscribedLineage(const TaskID &task_id,
std::unordered_set<TaskID> &seen) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you can maybe default initialize here so you don't have to pass it in on the root call const std::unordered_set<TaskID> &seen = std::unordered_set<TaskID>() (but is potentially uglier).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the current way.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7325/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7324/
Test PASSed.

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@robertnishihara robertnishihara merged commit 6ab01a2 into ray-project:master Aug 8, 2018
@robertnishihara robertnishihara deleted the fix-lineage-cache-count branch August 8, 2018 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants