[xray] Improve flush algorithm for the lineage cache #2130

stephanie-wang · 2018-05-23T23:08:11Z

What do these changes do?

This improves the algorithm for flushing tasks from the lineage cache to the GCS. A task only gets flushed if all of its parents have already been flushed to the GCS. Previously, we would only flush tasks in a batched method, where we would iterate over all tasks that were ready every time a new task was marked as ready. However, this could cause some tasks to get "stuck". For example, if a child task's parents are all committed, but no new tasks are marked as ready, then the child task would never be flushed.

This PR tracks the parent -> child dependencies for tasks that are ready to write. Then, when a task is committed, we use this map to check for any children that are now flushable.

AmplabJenkins · 2018-05-24T00:18:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5600/
Test PASSed.

robertnishihara · 2018-05-24T00:35:08Z

Not necessarily related to this PR, however, for testing correctness, it would be nice to be able to modulate the frequency of flushing. E.g., if we were flushing strictly on a timer, then we could make it very frequent or very infrequent, which could help catch bugs. What do you think about that?

Related to this PR, one category of scheduling bugs that we had over and over was related to scheduling. Basically, we never called DispatchTasks on a timer, and so we had to make sure to call DispatchTasks in every single event handler that could cause a task to become schedulable (e.g., task submission, worker blocking, worker death, etc). There were many places and it was very easy to miss one, which would result in a schedulable task not executing and so things would hang. Are you worried about the same issue arising for LinageCache flushing?

robertnishihara

Looks good to me! I left a couple small comments.

robertnishihara · 2018-05-27T00:44:03Z

src/ray/raylet/lineage_cache.cc

-  return ray::Status::OK();
+void LineageCache::Flush() {
+  // Iterate through all tasks that are READY.
+  std::vector<TaskID> ready_task_ids;


This isn't used anywhere is it? If not, we should remove it.

robertnishihara · 2018-05-27T00:47:03Z

src/ray/raylet/lineage_cache.cc

@@ -328,6 +334,24 @@ void LineageCache::HandleEntryCommitted(const UniqueID &task_id) {
    RAY_CHECK_OK(task_pubsub_.CancelNotifications(JobID::nil(), task_id, client_id_));
    subscribed_tasks_.erase(it);
  }
+
+  auto children_entry = uncommitted_ready_children_.find(task_id);


This code block probably deserves a comment explaining what it's doing.

stephanie-wang · 2018-05-28T21:48:56Z

Yeah, I left Flush() as public so that it could be called on a timer conveniently, which I think we should do in the node manager eventually.

Yeah, that could definitely come up for the lineage cache as well. This PR should help with the issue, but it might make sense to add a warning for tasks that haven't been flushed after a certain time, for example.

AmplabJenkins · 2018-05-28T23:00:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5667/
Test PASSed.

robertnishihara · 2018-05-29T00:44:39Z

Looks good, will merge once tests pass.

stephanie-wang · 2018-05-29T04:03:33Z

Thanks, @robertnishihara! I merged, since the failing tests look unrelated.

* master: [autoscaler] GCP node provider (ray-project#2061) [xray] Evict tasks from the lineage cache (ray-project#2152) [ASV] Add ray.init and simple Ray benchmarks (ray-project#2166) Re-encrypt key for uploading to S3 from travis to use travis-ci.com. (ray-project#2169) [rllib] Fix A3C PyTorch implementation (ray-project#2036) [JavaWorker] Do not kill local-scheduler-forked workers in RunManager.cleanup (ray-project#2151) Update Travis CI badge from travis-ci.org to travis-ci.com. (ray-project#2155) Implement Python global state API for xray. (ray-project#2125) [xray] Improve flush algorithm for the lineage cache (ray-project#2130) Fix support for actor classmethods (ray-project#2146) Add empty df test (ray-project#1879) [JavaWorker] Enable java worker support (ray-project#2094) [DataFrame] Fixing the code formatting of the tests (ray-project#2123) Update resource documentation (remove outdated limitations). (ray-project#2022) bugfix: use array redis_primary_addr out of its scope (ray-project#2139) Fix infinite retry in Push function. (ray-project#2133) [JavaWorker] Changes to the directory under src for support java worker (ray-project#2093) Integrate credis with Ray & route task table entries into credis. (ray-project#1841)

stephanie-wang added 5 commits May 23, 2018 13:40

Private method to flush a single task from the lineage cache

5a80c88

Track parent->child relationships for faster flushing

622a910

doc

98224d0

Only flush the newly ready task

effd81e

Flush() returns void

117aa66

stephanie-wang requested review from pcmoritz and robertnishihara May 23, 2018 23:08

robertnishihara approved these changes May 27, 2018

View reviewed changes

x

87283f8

stephanie-wang merged commit 166000b into ray-project:master May 29, 2018

robertnishihara deleted the lineage-stash-fast-flush branch May 29, 2018 05:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[xray] Improve flush algorithm for the lineage cache #2130

[xray] Improve flush algorithm for the lineage cache #2130

Uh oh!

stephanie-wang commented May 23, 2018

Uh oh!

AmplabJenkins commented May 24, 2018

Uh oh!

robertnishihara commented May 24, 2018

Uh oh!

robertnishihara left a comment

Uh oh!

robertnishihara May 27, 2018

Uh oh!

robertnishihara May 27, 2018

Uh oh!

stephanie-wang commented May 28, 2018

Uh oh!

AmplabJenkins commented May 28, 2018

Uh oh!

robertnishihara commented May 29, 2018

Uh oh!

stephanie-wang commented May 29, 2018

Uh oh!

Uh oh!

[xray] Improve flush algorithm for the lineage cache #2130

[xray] Improve flush algorithm for the lineage cache #2130

Uh oh!

Conversation

stephanie-wang commented May 23, 2018

What do these changes do?

Uh oh!

AmplabJenkins commented May 24, 2018

Uh oh!

robertnishihara commented May 24, 2018

Uh oh!

robertnishihara left a comment

Choose a reason for hiding this comment

Uh oh!

robertnishihara May 27, 2018

Choose a reason for hiding this comment

Uh oh!

robertnishihara May 27, 2018

Choose a reason for hiding this comment

Uh oh!

stephanie-wang commented May 28, 2018

Uh oh!

AmplabJenkins commented May 28, 2018

Uh oh!

robertnishihara commented May 29, 2018

Uh oh!

stephanie-wang commented May 29, 2018

Uh oh!

Uh oh!