Skip to content

[xray] Lineage cache requests notifications from the GCS about remote tasks #1834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 16, 2018

Conversation

stephanie-wang
Copy link
Contributor

What do these changes do?

This extends the lineage cache to request notifications about tasks that are scheduled to execute at a remote node. This allows each node to evict tasks from its local lineage cache once a notification about the remote task's commit is received.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4674/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4675/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4820/
Test PASSed.

// Stop listening for notifications about this task.
auto it = subscribed_tasks_.find(task_id);
if (it != subscribed_tasks_.end()) {
task_pubsub_.CancelNotifications(JobID::nil(), task_id, client_id_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like for this use case, it would be fine for the Cancel to happen automatically once the GCS sends the notification, right?

The issue with that is that it involves special-casing the behavior in the redis module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, for the Table interface, we could just call Cancel as soon as an entry is found since table entries are supposed to be immutable. Should I change that in this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be a subsequent PR if that makes sense.

// Stop listening for notifications about this task.
auto it = subscribed_tasks_.find(task_id);
if (it != subscribed_tasks_.end()) {
task_pubsub_.CancelNotifications(JobID::nil(), task_id, client_id_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably do something with the returned STATUS

/// will remain unchanged.
///
/// \param task_id The ID of the waiting task to remove.
void RemoveWaitingTask(const TaskID &task_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all of these methods return a status? Or should they handle failures internally?

@robertnishihara
Copy link
Collaborator

I added some status checks, but I wasn't really sure if they should be RAY_CHECK_OK or RAY_RETURN_NOT_OK.

@@ -286,12 +295,22 @@ void PopAncestorTasks(const UniqueID &task_id, Lineage &lineage) {
}

void LineageCache::HandleEntryCommitted(const UniqueID &task_id) {
RAY_LOG(DEBUG) << "task committed: " << task_id;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the current design, we can't actually guarantee that this method won't fire twice (or more) for the same task ID, right? Couldn't that cause issues here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it should be okay if this method fires twice. I think this function is idempotent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the only case I can think of right now where this would fire twice is if a task gets reconstructed and added to the table again. That should be pretty rare.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right :) I changed this to check that the status is already COMMITTED if the call to set the status fails.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4914/
Test FAILed.

@stephanie-wang
Copy link
Contributor Author

Thanks, @robertnishihara, I think the code you added looks good. We're basically calling CHECK_OK on all GCS operations besides the initial Subscribe calls, which I think is fine for now. We can revisit that once we start to explicitly handle GCS failure.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4922/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4920/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4926/
Test PASSed.

@robertnishihara
Copy link
Collaborator

Currently seeing

[ 63%] �[32m�[1mLinking CXX executable object_manager_test�[0m
In file included from /home/travis/build/ray-project/ray/src/ray/gcs/client_test.cc:5:0:
/home/travis/build/ray-project/ray/src/common/cmake/../thirdparty/hiredis/adapters/ae.h:102:12: warning: ‘int redisAeAttach(aeEventLoop*, redisAsyncContext*)’ defined but not used [-Wunused-function]
 static int redisAeAttach(aeEventLoop *loop, redisAsyncContext *ac) {
            ^
../libray.a(client.cc.o):(.data.rel.ro._ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE[_ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE]+0x10): undefined reference to `ray::gcs::PubsubInterface<ray::UniqueID>::RequestNotifications(ray::UniqueID const&, ray::UniqueID const&, ray::UniqueID const&)'
../libray.a(client.cc.o):(.data.rel.ro._ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE[_ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE]+0x18): undefined reference to `ray::gcs::PubsubInterface<ray::UniqueID>::CancelNotifications(ray::UniqueID const&, ray::UniqueID const&, ray::UniqueID const&)'
collect2: error: ld returned 1 exit status
make[2]: *** [src/ray/object_manager/object_manager_test] Error 1
make[1]: *** [src/ray/object_manager/CMakeFiles/object_manager_test.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 64%] �[32m�[1mLinking CXX executable client_test�[0m
../libray.a(client.cc.o):(.data.rel.ro._ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE[_ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE]+0x10): undefined reference to `ray::gcs::PubsubInterface<ray::UniqueID>::RequestNotifications(ray::UniqueID const&, ray::UniqueID const&, ray::UniqueID const&)'
../libray.a(client.cc.o):(.data.rel.ro._ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE[_ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE]+0x18): undefined reference to `ray::gcs::PubsubInterface<ray::UniqueID>::CancelNotifications(ray::UniqueID const&, ray::UniqueID const&, ray::UniqueID const&)'
collect2: error: ld returned 1 exit status
make[2]: *** [src/ray/gcs/client_test] Error 1
make[1]: *** [src/ray/gcs/CMakeFiles/client_test.dir/all] Error 2
make: *** [all] Error 2

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4928/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4929/
Test PASSed.

@robertnishihara robertnishihara merged commit 6bd944a into ray-project:master Apr 16, 2018
@robertnishihara robertnishihara deleted the lineage-cache-gcs branch April 16, 2018 03:16
royf added a commit to royf/ray that referenced this pull request Apr 22, 2018
* master:
  Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929)
  [DataFrame] Adding read methods and tests (ray-project#1712)
  Allow task_table_update to fail when tasks are finished. (ray-project#1927)
  [rllib] Contribute DDPG to RLlib (ray-project#1877)
  [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920)
  Raylet task dispatch and throttling worker startup (ray-project#1912)
  [DataFrame] Eval fix (ray-project#1903)
  [tune] Polishing docs (ray-project#1846)
  [tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling (ray-project#1848)
  Preemptively push local arguments for actor tasks (ray-project#1901)
  [tune] Allow fetching pinned objects from trainable functions (ray-project#1895)
  Multithreading refactor for ObjectManager. (ray-project#1911)
  Add slice functionality (ray-project#1832)
  [DataFrame] Pass read_csv kwargs to _infer_column (ray-project#1894)
  Addresses missed comments from multichunk object transfer PR. (ray-project#1908)
  Allow numpy arrays to be passed by value into tasks (and inlined in the task spec). (ray-project#1816)
  [xray] Lineage cache requests notifications from the GCS about remote tasks (ray-project#1834)
  Fix UI issue for non-json-serializable task arguments. (ray-project#1892)
  Remove unnecessary calls to .hex() for object IDs. (ray-project#1910)
  Allow multiple raylets to be started on a single machine. (ray-project#1904)

# Conflicts:
#	python/ray/rllib/__init__.py
#	python/ray/rllib/dqn/dqn.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants