-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[xray] Evict tasks from the lineage cache #2152
[xray] Evict tasks from the lineage cache #2152
Conversation
…committed lineage
…testing for out-of-order operations
Test FAILed. |
jenkins, retest this please |
Test PASSed. |
src/ray/raylet/lineage_cache.cc
Outdated
} else { | ||
return false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Purely a style thing, but I think it'd be cleaner to have a single return statement and to do something like
bool task_present = it != subscribed_tasks_.end();
if (task_present) {
...
}
return task_present;
and similarly for the above block.
src/common/state/ray_config.h
Outdated
@@ -112,6 +114,7 @@ class RayConfig { | |||
get_timeout_milliseconds_(1000), | |||
worker_get_request_size_(10000), | |||
worker_fetch_request_size_(10000), | |||
max_lineage_size_(1000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 feels pretty big to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what do you think is a good size?
// NOTE(swang): The number of entries in the uncommitted lineage also | ||
// includes local tasks that haven't been committed yet, not just remote | ||
// tasks, so this is an overestimate. | ||
const auto uncommitted_lineage = GetUncommittedLineage(task_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like an expensive operation, especially if we allow the uncommitted lineage to grow to size 1000. And this is called every time we forward a task, right? Will this be an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be called every time we forward a task. It shouldn't be too expensive since it's just walking a tree, but I guess it's hard to say. Would it help if we added a timing statement to GetUncommittedLineage
and logged a warning if it exceeds maybe 1ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you about the timing statement, I assume it will be visible in the profiler if it's an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm okay, I think it's a good sanity check. I'll run this quickly locally and see how many tasks it takes to get to 1ms.
// includes local tasks that haven't been committed yet, not just remote | ||
// tasks, so this is an overestimate. | ||
const auto uncommitted_lineage = GetUncommittedLineage(task_id); | ||
if (uncommitted_lineage.GetEntries().size() > max_lineage_size_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What goes wrong if max_lineage_size_ = 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think that will basically just try to evict every single entry in the lineage, but the code itself should not break. The tradeoff is basically the lower max_lineage_size_
is, the more often a node will request notifications from the GCS.
I think this can be merged when the tests pass. I think for the timing statement either way is ok. |
Test FAILed. |
Okay, we can leave out the timing statement for now, but after a quick test, |
Test PASSed. |
* master: [autoscaler] GCP node provider (ray-project#2061) [xray] Evict tasks from the lineage cache (ray-project#2152) [ASV] Add ray.init and simple Ray benchmarks (ray-project#2166) Re-encrypt key for uploading to S3 from travis to use travis-ci.com. (ray-project#2169) [rllib] Fix A3C PyTorch implementation (ray-project#2036) [JavaWorker] Do not kill local-scheduler-forked workers in RunManager.cleanup (ray-project#2151) Update Travis CI badge from travis-ci.org to travis-ci.com. (ray-project#2155) Implement Python global state API for xray. (ray-project#2125) [xray] Improve flush algorithm for the lineage cache (ray-project#2130) Fix support for actor classmethods (ray-project#2146) Add empty df test (ray-project#1879) [JavaWorker] Enable java worker support (ray-project#2094) [DataFrame] Fixing the code formatting of the tests (ray-project#2123) Update resource documentation (remove outdated limitations). (ray-project#2022) bugfix: use array redis_primary_addr out of its scope (ray-project#2139) Fix infinite retry in Push function. (ray-project#2133) [JavaWorker] Changes to the directory under src for support java worker (ray-project#2093) Integrate credis with Ray & route task table entries into credis. (ray-project#1841)
What do these changes do?
This improves the eviction algorithm for the lineage cache. Changes:
COMMITTED
status from the lineage cache. These are tasks that have been committed in the GCS. Instead of keeping these entries around for an indefinite time, tasks are now evicted from the cache as soon as a notification for their commit is received.REMOTE
and must be evicted. This change periodically evicts these remote tasks by requesting a notification for their commit after the uncommitted lineage exceeds the configured size.