-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix infinite retry in Push function. #2133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix infinite retry in Push function. #2133
Conversation
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes, this looks good to me. It'd be great if @elibol or @stephanie-wang could comment on the retry design.
if (retry < 0) { | ||
retry = config_.max_retries; | ||
} else if (retry == 0) { | ||
RAY_LOG(ERROR) << "Invalid Push request ObjectID: " << object_id.hex() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove .hex()
, just object_id
is fine because it automatically converts to hex
} | ||
|
||
ray::Status ObjectManager::Push(const ObjectID &object_id, const ClientID &client_id) { | ||
ray::Status ObjectManager::Push(const ObjectID &object_id, const ClientID &client_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elibol, it looks like what will typically happen is all max_retries
Push
calls will happen in quick succession (with essentially no delay between them unless the manager is busy with other events), is that right? Is that how we want to do this? An alternative approach is to keep a list of object/client pairs that we want to push and then do any push events that can be done when a new object becomes available.
Not suggesting that for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the pushes will happen in quick succession. The post
call will queue the retry attempt on the main thread. Yes, I think the more long-term solution is to maintain a mapping from one object to many clients, and push to those clients when those objects become available locally. This mapping would be periodically cleared.
src/common/state/ray_config.h
Outdated
return object_manager_max_receives_; | ||
} | ||
|
||
int object_manager_max_retries() const { return object_manager_max_retries_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be more specific, e.g., object_manager_max_retries
-> object_manager_max_push_retries
connection_pool_() { | ||
RAY_CHECK(config_.max_sends > 0); | ||
RAY_CHECK(config_.max_receives > 0); | ||
RAY_CHECK(config_.max_retries > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_retries
-> max_push_retries
jenkins retest this please |
Instead of busy retrying without delay, I think it might make more sense to add the id in a waiting list and re-trigger the push when the object is added (in Also, I'm more curios about what could make this situation happen. The only reason I can think of is that the object was evicted after Pull, and before Push (pls correct me if this's wrong). In this case, we'll also need to reconstruct the object before retrying. |
Test PASSed. |
Test FAILed. |
@raulchen yes, this is rare. Your proposed solution sounds right. We would also want to periodically clear the (object_id, client_id) wait list. But this may make sense coupled with a change to the code at task forwarding time (in the node manager): If we push task dependencies (the objects) to the node to which a task is forwarded without checking whether the object is local, some of those objects may never be local on the node that is forwarding the task. Additionally, with such changes to task forwarding, some objects may appear locally after the call to push: If we invoke an actor method, the task may be forwarded before the object store notification is received by the node manager. We currently check that an object is local before pushing, which sometimes fails to push objects for which an add notification hasn't yet been received. See https://github.com/ray-project/ray/blob/master/src/ray/raylet/node_manager.cc#L821. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes cap the number of retry attempts for a particular push. A longer term solution is to use notifications from the object store instead of repeatedly invoking push. Thanks for your contribution! I will approve this PR once the changes I've requested are made, and merge once tests are passing.
@guoyuhong if you're interested, I would also be happy to review a subsequent PR that improves on your solution by using object store notifications (e.g. by adding a check for ObjectID
in a map from one object_id
to many client_id
s in the object manager handler for SubscribeObjAdded
).
src/common/state/ray_config.h
Outdated
/// Maximum number of concurrent receives allowed by the object manager. | ||
int object_manager_max_receives_; | ||
|
||
/// Maximum push retry times allowed by the object manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to "Maximum push retries allowed by the object manager."
/// Push an object to to the node manager on the node corresponding to client id. | ||
/// | ||
/// \param object_id The object's object id. | ||
/// \param client_id The remote node's client id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document retry
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be interesting. In my observation in the object_manager_stress_test, this retry mechanism is triggered, so a long-term solution is necessary.
By my understanding, we can first check the object_id in local_objects_ in the Push function. If it exists, the Push function continues. Otherwise, we can add this request to a map (as you described), and in NotifyDirectoryObjectAdd, the Push request will be re-triggered. It that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct.
Test FAILed. |
@robertnishihara Hi Robert, could you please re-trigger the jenkins test. Only one test passed in previous tests, the rest tests failed due to time out... Is that expected? |
jenkins, retest this please |
Test PASSed. |
* master: [autoscaler] GCP node provider (ray-project#2061) [xray] Evict tasks from the lineage cache (ray-project#2152) [ASV] Add ray.init and simple Ray benchmarks (ray-project#2166) Re-encrypt key for uploading to S3 from travis to use travis-ci.com. (ray-project#2169) [rllib] Fix A3C PyTorch implementation (ray-project#2036) [JavaWorker] Do not kill local-scheduler-forked workers in RunManager.cleanup (ray-project#2151) Update Travis CI badge from travis-ci.org to travis-ci.com. (ray-project#2155) Implement Python global state API for xray. (ray-project#2125) [xray] Improve flush algorithm for the lineage cache (ray-project#2130) Fix support for actor classmethods (ray-project#2146) Add empty df test (ray-project#1879) [JavaWorker] Enable java worker support (ray-project#2094) [DataFrame] Fixing the code formatting of the tests (ray-project#2123) Update resource documentation (remove outdated limitations). (ray-project#2022) bugfix: use array redis_primary_addr out of its scope (ray-project#2139) Fix infinite retry in Push function. (ray-project#2133) [JavaWorker] Changes to the directory under src for support java worker (ray-project#2093) Integrate credis with Ray & route task table entries into credis. (ray-project#1841)
Hello!
This is Yuhong Guo from Ant Financial. I'm trying to do some small code change to go through the full code review process.
What do these changes do?
Related issue number
N/A