Skip to content

[xray] Fix bug when counting a task's lineage size#2600

Merged
robertnishihara merged 3 commits into
ray-project:masterfrom
stephanie-wang:fix-lineage-cache-count
Aug 8, 2018
Merged

[xray] Fix bug when counting a task's lineage size#2600
robertnishihara merged 3 commits into
ray-project:masterfrom
stephanie-wang:fix-lineage-cache-count

Conversation

@stephanie-wang
Copy link
Copy Markdown
Contributor

What do these changes do?

In the lineage cache, we sometimes count the size of the uncommitted lineage of a task so that we can decide whether we should evict it. The method to count the size walks the task's lineage tree but does not consider whether a node has been seen before, so it ends up double-counting any nodes that appear twice in the tree.

In the test case python -m pytest test/stress_tests.py::test_dependencies[ray_start_combination1], which submits tasks with many interleaving dependencies, this resulted in crashes of the raylet because a handler on the event loop could end up spending minutes in the CountUnsubscribedLineage method. Then, heartbeats were not sent quickly enough and the raylet would be marked as dead.

This PR adds back the test/stress_tests.py::test_dependencies and test/stress_tests.py::test_submitting_many_tasks tests for raylet.


uint64_t LineageCache::CountUnsubscribedLineage(const TaskID &task_id) const {
uint64_t LineageCache::CountUnsubscribedLineage(const TaskID &task_id,
std::unordered_set<TaskID> &seen) const {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you can maybe default initialize here so you don't have to pass it in on the root call const std::unordered_set<TaskID> &seen = std::unordered_set<TaskID>() (but is potentially uglier).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the current way.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7325/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7324/
Test PASSed.

Copy link
Copy Markdown
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@robertnishihara robertnishihara merged commit 6ab01a2 into ray-project:master Aug 8, 2018
@robertnishihara robertnishihara deleted the fix-lineage-cache-count branch August 8, 2018 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants