Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xray] Object manager retries Pull requests #2630

Merged
merged 10 commits into from
Aug 14, 2018

Conversation

stephanie-wang
Copy link
Contributor

What do these changes do?

This modifies the object manager to keep track of Pull requests and retry ones that have not succeeded within a certain timeout.

For every Pull request, the object manager will subscribe to the object's locations with the GCS. On each notification from the GCS, the object manager will reset its list of known locations for the object.

Once the object manager has a list of known locations, it will try sending a Pull request to each client successively. Once a client has been tried, a timer is set. If the object is received, or the request is Canceled, within the timeout, then the timer is canceled and the Pull request is cleaned up. Else, the next client in the list is tried, and so on.

- suppress duplicate Pulls
- retry the Pull at the next client after a timeout
- cancel a Pull if the object no longer appears on any clients
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7404/
Test PASSed.

@robertnishihara robertnishihara requested a review from ujvl August 10, 2018 07:03
@robertnishihara
Copy link
Collaborator

cc @heyucongtom

Copy link
Contributor

@elibol elibol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good (contingent on answers to some of my questions). The existing functionality is preserved while providing robustness to component failure, and suppression of multiple pull requests to the same object. Note that this does not imply suppression of multiple transfers of the same object, as this is still possible under certain circumstances.

Consider reverting the reorganization of the header files to make it easier for others to review. I had done this quite a while ago (just needs to be approved): #2251

My understanding of the changes follow.

A Pull will obtain multiple locations from which to request an object. Once Pull has been invoked for an object id, subsequent calls with the same object id are suppressed. The Pull opens a subscription to object locations, which is maintained until one of two things occurs. (1) A notification from the object store is received indicating the object is local. The handler now invokes Cancel on all objects that appear locally, which includes those for which a Pull request is open. (2) The Pull is cancelled by another entity.

Other points worth noting:

  1. The SubscribeObjectLocations handler for Pull appears to be correct.
  2. TryPull will fail if it's invoked when no client ids are left in the client locations vector. This is currently difficult to reason about and could arguably be simplified by returning if client locations is empty. The correctness of this depends on whether Cancel implicitly cancels active asio timers associated with the cancelled object.
  3. Whenever Cancel is invoked, it does nothing if a Pull request is not active for the given object id, and properly terminates an active request when one exists. This appears to be the case, although the behavior in its current state is not easy to reason about since timers are not explicitly cancelled: It's unclear whether asio timers are cancelled when their unique pointers are no longer referenced.

}

pull_requests_.emplace(object_id, PullRequest());
return object_directory_->SubscribeObjectLocations(
object_directory_pull_callback_id_, object_id,
[this](const std::vector<ClientID> &client_ids, const ObjectID &object_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment here indicating that, so long as the subscription is not cancelled, this method will be invoked whenever the vector of client ids on which the object is available changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

// The timer was set if there were more clients to try. We record this
// now so that we can decide whether we should reset the timer when
// trying the new client locations.
bool timer_was_set = !it->second.client_locations.empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using an explicit variable to indicate whether the timer has been set. This will make the code easier to read, and will also avoid conflating the semantics of the client locations size with whether the timer has been set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.


RAY_CHECK_OK(object_directory_->UnsubscribeObjectLocations(
object_directory_pull_callback_id_, object_id));
pull_requests_.erase(it);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, we don't cancel running timers here because TryPull properly handles the case of a timer firing for a cancelled pull request. If this is true, it might be worth noting here to avoid confusion about what is expected when pull requests are cancelled.

Some questions:

  1. The timer may have been instantiated but not running; in this case does calling cancel throw an exception?
  2. When the timer is running and no longer referenced from its unique pointer, does it cancel itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this scenario:

  1. Pull is invoked on object A. Multiple object locations are discovered, so a timer is set after a request from the first object location is made.
  2. Cancel is invoked on A.
  3. Pull is again invoked on A.

If the timer does not cancel itself when its unique pointer is no longer referenced, then the timer will trigger TryPull. This is a timer other than the timer created by the second pull request for A. If it's invoked before a response from SubscribeObjectLocations is received, then TryPull will fail (client locations will be empty).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer destructor will cancel any remaining handlers properly (see my comment on the PR thread).

// NOTE(swang): Since we are overwriting the previous list of clients,
// we may end up sending a duplicate request to the same client as
// before.
it->second.client_locations = client_ids;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick fix to reduce the likelihood of consecutively requesting an object from the same client is to shuffle this vector when its length is greater than 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary since the client IDs are built from an unordered_set, so they may already be slightly shuffled. Also, we pop from the back of the list instead of the front in TryPull.

ray::Status status = object_directory_->UnsubscribeObjectLocations(
object_directory_pull_callback_id_, object_id);
return status;
void ObjectManager::Cancel(const ObjectID &object_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing the name of this method to CancelPull.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

it->second.retry_timer->cancel();
}
} else {
// New object locations were found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a bool to indicate whether this is the first invocation of this handler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate where the bool is set and what it should be used for?

if (timer_was_set) {
// Cancel the timer if we were waiting to try the next client.
RAY_CHECK(it->second.retry_timer != nullptr);
it->second.retry_timer->cancel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If invoking cancel when a timer is not running does nothing, then consider invoking cancel whenever the timer has been instantiated and the client locations list is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@stephanie-wang
Copy link
Contributor Author

Thanks @melih. Your understanding of the code seems right to me.

  1. I generally prefer to have a stronger spec for individual functions and for something to fail loudly instead of silently. If TryPull is called with a non-empty client list, I would rather it crashed so that we could fix the bug.
  2. The boost documentation for the timer destructor says this: This function destroys the timer, cancelling any outstanding asynchronous wait operations associated with the timer as if by calling cancel.

Copy link
Contributor

@elibol elibol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7408/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7414/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7432/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7437/
Test PASSed.

@pcmoritz pcmoritz merged commit 806fdf2 into ray-project:master Aug 14, 2018
@pcmoritz pcmoritz deleted the object-manager-retry branch August 14, 2018 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants