Skip to content

Conversation

guoyuhong
Copy link
Contributor

Hello!
This is Yuhong Guo from Ant Financial. I'm trying to do some small code change to go through the full code review process.

What do these changes do?

  • Remove the infinite call in Push function.
  • Fix some comment mismatches.

Related issue number

N/A

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5613/
Test FAILed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes, this looks good to me. It'd be great if @elibol or @stephanie-wang could comment on the retry design.

if (retry < 0) {
retry = config_.max_retries;
} else if (retry == 0) {
RAY_LOG(ERROR) << "Invalid Push request ObjectID: " << object_id.hex()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove .hex(), just object_id is fine because it automatically converts to hex

}

ray::Status ObjectManager::Push(const ObjectID &object_id, const ClientID &client_id) {
ray::Status ObjectManager::Push(const ObjectID &object_id, const ClientID &client_id,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elibol, it looks like what will typically happen is all max_retries Push calls will happen in quick succession (with essentially no delay between them unless the manager is busy with other events), is that right? Is that how we want to do this? An alternative approach is to keep a list of object/client pairs that we want to push and then do any push events that can be done when a new object becomes available.

Not suggesting that for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the pushes will happen in quick succession. The post call will queue the retry attempt on the main thread. Yes, I think the more long-term solution is to maintain a mapping from one object to many clients, and push to those clients when those objects become available locally. This mapping would be periodically cleared.

return object_manager_max_receives_;
}

int object_manager_max_retries() const { return object_manager_max_retries_; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be more specific, e.g., object_manager_max_retries -> object_manager_max_push_retries

connection_pool_() {
RAY_CHECK(config_.max_sends > 0);
RAY_CHECK(config_.max_receives > 0);
RAY_CHECK(config_.max_retries > 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_retries -> max_push_retries

@robertnishihara
Copy link
Collaborator

jenkins retest this please

@raulchen
Copy link
Contributor

Instead of busy retrying without delay, I think it might make more sense to add the id in a waiting list and re-trigger the push when the object is added (in NotifyDirectoryObjectAdd function)?

Also, I'm more curios about what could make this situation happen. The only reason I can think of is that the object was evicted after Pull, and before Push (pls correct me if this's wrong). In this case, we'll also need to reconstruct the object before retrying.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5616/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5619/
Test FAILed.

@elibol
Copy link
Contributor

elibol commented May 24, 2018

@raulchen yes, this is rare. Your proposed solution sounds right. We would also want to periodically clear the (object_id, client_id) wait list.

But this may make sense coupled with a change to the code at task forwarding time (in the node manager): If we push task dependencies (the objects) to the node to which a task is forwarded without checking whether the object is local, some of those objects may never be local on the node that is forwarding the task. Additionally, with such changes to task forwarding, some objects may appear locally after the call to push: If we invoke an actor method, the task may be forwarded before the object store notification is received by the node manager. We currently check that an object is local before pushing, which sometimes fails to push objects for which an add notification hasn't yet been received. See https://github.com/ray-project/ray/blob/master/src/ray/raylet/node_manager.cc#L821.

elibol
elibol previously requested changes May 24, 2018
Copy link
Contributor

@elibol elibol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes cap the number of retry attempts for a particular push. A longer term solution is to use notifications from the object store instead of repeatedly invoking push. Thanks for your contribution! I will approve this PR once the changes I've requested are made, and merge once tests are passing.

@guoyuhong if you're interested, I would also be happy to review a subsequent PR that improves on your solution by using object store notifications (e.g. by adding a check for ObjectID in a map from one object_id to many client_ids in the object manager handler for SubscribeObjAdded).

/// Maximum number of concurrent receives allowed by the object manager.
int object_manager_max_receives_;

/// Maximum push retry times allowed by the object manager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to "Maximum push retries allowed by the object manager."

/// Push an object to to the node manager on the node corresponding to client id.
///
/// \param object_id The object's object id.
/// \param client_id The remote node's client id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be interesting. In my observation in the object_manager_stress_test, this retry mechanism is triggered, so a long-term solution is necessary.

By my understanding, we can first check the object_id in local_objects_ in the Push function. If it exists, the Push function continues. Otherwise, we can add this request to a map (as you described), and in NotifyDirectoryObjectAdd, the Push request will be re-triggered. It that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking to @raulchen and @eric-jj , I decide to change the code to waiting-list version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5628/
Test FAILed.

@guoyuhong
Copy link
Contributor Author

@robertnishihara Hi Robert, could you please re-trigger the jenkins test. Only one test passed in previous tests, the rest tests failed due to time out... Is that expected?

@robertnishihara
Copy link
Collaborator

jenkins, retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5631/
Test PASSed.

@robertnishihara robertnishihara dismissed elibol’s stale review May 25, 2018 08:16

comments addressed

@robertnishihara robertnishihara merged commit a8517cc into ray-project:master May 25, 2018
@guoyuhong guoyuhong deleted the FixInfiniteRetry branch May 27, 2018 03:25
alok added a commit to alok/ray that referenced this pull request Jun 3, 2018
* master:
  [autoscaler] GCP node provider (ray-project#2061)
  [xray] Evict tasks from the lineage cache (ray-project#2152)
  [ASV] Add ray.init and simple Ray benchmarks (ray-project#2166)
  Re-encrypt key for uploading to S3 from travis to use travis-ci.com. (ray-project#2169)
  [rllib] Fix A3C PyTorch implementation (ray-project#2036)
  [JavaWorker] Do not kill local-scheduler-forked workers in RunManager.cleanup (ray-project#2151)
  Update Travis CI badge from travis-ci.org to travis-ci.com. (ray-project#2155)
  Implement Python global state API for xray. (ray-project#2125)
  [xray] Improve flush algorithm for the lineage cache (ray-project#2130)
  Fix support for actor classmethods (ray-project#2146)
  Add empty df test (ray-project#1879)
  [JavaWorker] Enable java worker support (ray-project#2094)
  [DataFrame] Fixing the code formatting of the tests (ray-project#2123)
  Update resource documentation (remove outdated limitations). (ray-project#2022)
  bugfix: use array redis_primary_addr out of its scope (ray-project#2139)
  Fix infinite retry in Push function. (ray-project#2133)
  [JavaWorker] Changes to the directory under src for support java worker (ray-project#2093)
  Integrate credis with Ray & route task table entries into credis. (ray-project#1841)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants