Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xray] Evict tasks from the lineage cache #2152

Merged

Conversation

stephanie-wang
Copy link
Contributor

What do these changes do?

This improves the eviction algorithm for the lineage cache. Changes:

  1. Remove the COMMITTED status from the lineage cache. These are tasks that have been committed in the GCS. Instead of keeping these entries around for an indefinite time, tasks are now evicted from the cache as soon as a notification for their commit is received.
  2. Attempt to bound each remote task's uncommitted lineage to a configurable maximum lineage size. This is necessary for actor tasks, where a long chain of tasks may be forwarded to another node for execution. In this case, the tasks will remain in the sending node's lineage cache with status REMOTE and must be evicted. This change periodically evicts these remote tasks by requesting a notification for their commit after the uncommitted lineage exceeds the configured size.
  3. Better checking for requesting/canceling notifications on a task.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5691/
Test FAILed.

@robertnishihara
Copy link
Collaborator

jenkins, retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5710/
Test PASSed.

} else {
return false;
}
}
Copy link
Collaborator

@robertnishihara robertnishihara May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely a style thing, but I think it'd be cleaner to have a single return statement and to do something like

bool task_present = it != subscribed_tasks_.end();
if (task_present) {
  ...
}
return task_present;

and similarly for the above block.

@@ -112,6 +114,7 @@ class RayConfig {
get_timeout_milliseconds_(1000),
worker_get_request_size_(10000),
worker_fetch_request_size_(10000),
max_lineage_size_(1000),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000 feels pretty big to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what do you think is a good size?

// NOTE(swang): The number of entries in the uncommitted lineage also
// includes local tasks that haven't been committed yet, not just remote
// tasks, so this is an overestimate.
const auto uncommitted_lineage = GetUncommittedLineage(task_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like an expensive operation, especially if we allow the uncommitted lineage to grow to size 1000. And this is called every time we forward a task, right? Will this be an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be called every time we forward a task. It shouldn't be too expensive since it's just walking a tree, but I guess it's hard to say. Would it help if we added a timing statement to GetUncommittedLineage and logged a warning if it exceeds maybe 1ms?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you about the timing statement, I assume it will be visible in the profiler if it's an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay, I think it's a good sanity check. I'll run this quickly locally and see how many tasks it takes to get to 1ms.

// includes local tasks that haven't been committed yet, not just remote
// tasks, so this is an overestimate.
const auto uncommitted_lineage = GetUncommittedLineage(task_id);
if (uncommitted_lineage.GetEntries().size() > max_lineage_size_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What goes wrong if max_lineage_size_ = 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think that will basically just try to evict every single entry in the lineage, but the code itself should not break. The tradeoff is basically the lower max_lineage_size_ is, the more often a node will request notifications from the GCS.

@robertnishihara
Copy link
Collaborator

I think this can be merged when the tests pass. I think for the timing statement either way is ok.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5735/
Test FAILed.

@stephanie-wang
Copy link
Contributor Author

Okay, we can leave out the timing statement for now, but after a quick test, GetUncommittedLineage with size 1000 takes over 1ms, so it probably makes sense to lower that. I'll lower it to 100.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5737/
Test PASSed.

@robertnishihara robertnishihara merged commit 117107c into ray-project:master May 31, 2018
@robertnishihara robertnishihara deleted the evict-lineage-cache branch May 31, 2018 07:24
alok added a commit to alok/ray that referenced this pull request Jun 3, 2018
* master:
  [autoscaler] GCP node provider (ray-project#2061)
  [xray] Evict tasks from the lineage cache (ray-project#2152)
  [ASV] Add ray.init and simple Ray benchmarks (ray-project#2166)
  Re-encrypt key for uploading to S3 from travis to use travis-ci.com. (ray-project#2169)
  [rllib] Fix A3C PyTorch implementation (ray-project#2036)
  [JavaWorker] Do not kill local-scheduler-forked workers in RunManager.cleanup (ray-project#2151)
  Update Travis CI badge from travis-ci.org to travis-ci.com. (ray-project#2155)
  Implement Python global state API for xray. (ray-project#2125)
  [xray] Improve flush algorithm for the lineage cache (ray-project#2130)
  Fix support for actor classmethods (ray-project#2146)
  Add empty df test (ray-project#1879)
  [JavaWorker] Enable java worker support (ray-project#2094)
  [DataFrame] Fixing the code formatting of the tests (ray-project#2123)
  Update resource documentation (remove outdated limitations). (ray-project#2022)
  bugfix: use array redis_primary_addr out of its scope (ray-project#2139)
  Fix infinite retry in Push function. (ray-project#2133)
  [JavaWorker] Changes to the directory under src for support java worker (ray-project#2093)
  Integrate credis with Ray & route task table entries into credis. (ray-project#1841)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants