-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[xray] Fix bug when counting a task's lineage size #2600
[xray] Fix bug when counting a task's lineage size #2600
Conversation
@@ -211,7 +211,12 @@ void LineageCache::AddReadyTask(const Task &task) { | |||
} | |||
} | |||
|
|||
uint64_t LineageCache::CountUnsubscribedLineage(const TaskID &task_id) const { | |||
uint64_t LineageCache::CountUnsubscribedLineage(const TaskID &task_id, | |||
std::unordered_set<TaskID> &seen) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you can maybe default initialize here so you don't have to pass it in on the root call const std::unordered_set<TaskID> &seen = std::unordered_set<TaskID>()
(but is potentially uglier).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer the current way.
Test FAILed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What do these changes do?
In the lineage cache, we sometimes count the size of the uncommitted lineage of a task so that we can decide whether we should evict it. The method to count the size walks the task's lineage tree but does not consider whether a node has been seen before, so it ends up double-counting any nodes that appear twice in the tree.
In the test case
python -m pytest test/stress_tests.py::test_dependencies[ray_start_combination1]
, which submits tasks with many interleaving dependencies, this resulted in crashes of the raylet because a handler on the event loop could end up spending minutes in theCountUnsubscribedLineage
method. Then, heartbeats were not sent quickly enough and the raylet would be marked as dead.This PR adds back the
test/stress_tests.py::test_dependencies
andtest/stress_tests.py::test_submitting_many_tasks
tests for raylet.