-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][usability] Disambiguate ObjectLostErrors for better understandability #18292
[core][usability] Disambiguate ObjectLostErrors for better understandability #18292
Conversation
The docstrings are pretty self-explanatory (I hope), but I should probably update the docs around error handling also, might do this in a follow-up PR. Also, I'm planning on keeping the ObjectLostError and having all other errors subclass that, for backwards compatibility. But for now, I've defined a generic ObjectUnreachableError as the superclass so we can catch any broken tests in the CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM. Have one comment regarding how to handle already-evicted objects for OBOD
location_info, object_id, | ||
/*location_lookup_failed*/ !location_info.ref_removed()); | ||
if (location_info.ref_removed()) { | ||
mark_as_failed_(object_id, rpc::ErrorType::OBJECT_RELEASED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is probably not necessary? This APIs are used to pull new objects, and I feel like this error should be already propagated by the higher level layer (in the future resolution)? I also have seen this this happens A LOT, and kind of worried that it will create so many synchronous calls to plasma store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you happen to find when this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do want to throw the error because it can be a real reference counting issue. But I agree we shouldn't log WARNINGs.
@@ -1151,9 +1151,13 @@ Status ReferenceCounter::FillObjectInformation( | |||
absl::MutexLock lock(&mutex_); | |||
auto it = object_id_refs_.find(object_id); | |||
if (it == object_id_refs_.end()) { | |||
return Status::ObjectNotFound("Object " + object_id.Hex() + " not found"); | |||
RAY_LOG(WARNING) << "Object locations requested for " << object_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as below, but I've seen this happens a lot I think because this is a common scenario;
- OBOD send a request
- Object evicted
- OBOD request arrived
- Then it is no-op
I think we are handling this in the low level layer (object manager)? so my impression is having WARNING logs can have misleading information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't understand why this is happening a lot? We shouldn't be evicting the object until refs are out of scope, so in the common case, OBOD requests should always succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this isn't a WARN, it's an expected race condition (DEBUG).
The reason it can always happen is if OBOD RPCs arrive late, or stale object fetches in pull manager. In general, we should avoid these kind of spammy warning messages when the system can handle stale/outdated requests, it's bitten us several times in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level request to make the error messages follow the style guidelines: https://spark.apache.org/error-message-guidelines.html
python/ray/exceptions.py
Outdated
"""Indicates that an object has been lost due to node failure. | ||
# TODO(XXX): Replace with ObjectLostError for backwards compatibility once all | ||
# Python tests pass. | ||
class ObjectUnreachableError(RayError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class ObjectUnreachableError(RayError): | |
class ObjectLostError(RayError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll switch this back later, just putting this here for now so that I can see which Python tests would fail right now.
python/ray/exceptions.py
Outdated
"If you did not receive a message about a worker node " | ||
"dying, this is likely a system-level bug. " | ||
"Please file an issue on GitHub at " | ||
"https://github.com/ray-project/ray/issues/new/choose.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following https://spark.apache.org/error-message-guidelines.html, this can be rewritten as
"All copies of {obj} have been lost due to node failure. Check cluster logs for more information about the failure." (be direct).
python/ray/exceptions.py
Outdated
"ObjectRef's owner can be updated. For example, suppose we " | ||
"call x_ref = foo.remote(...), pass [x_ref] to an actor A, and " | ||
"then actor A passes x_ref to actor B by calling " | ||
"B.bar.remote(x_ref). In this case, the driver may release x " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a P0 bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a long-living edge case we didn't handle. I remember I saw this from Stephanie's design doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's an existing corner case.
python/ray/exceptions.py
Outdated
|
||
def __str__(self): | ||
return super().__str__() + "\n\n" + ( | ||
f"Object {self.object_ref_hex} has already been released.\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReferenceCountingAssertionFailure: Attempted to retrieve an already-deleted object. This should not happen.
python/ray/exceptions.py
Outdated
def __str__(self): | ||
return super().__str__() + "\n\n" + ( | ||
f"Attempted lineage reconstruction to recover object " | ||
"{self.object_ref_hex}, but recovery failed. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The object could not be reconstructed since the max number of retries was exceeded.
else: | ||
msg += ( | ||
" To see information about where this ObjectRef was created " | ||
"To see information about where this ObjectRef was created " | ||
"in Python, set the environment variable " | ||
"RAY_record_ref_creation_sites=1 during `ray start` and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to set this for driver. Why don't we write a doc and make a link instead? (I think it is confusing for users anyway when they just read it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a doc, but left it out of the error message for now. I think it's best if we try to make the error messages standalone.
python/ray/exceptions.py
Outdated
return super().__str__() + "\n\n" + ( | ||
f"All copies of {self.object_ref_hex} are lost " | ||
"due to node failure.\n\n" | ||
"If you did not receive a message about a worker node " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, but I think we should write a debugging guideline doc and make a link here
python/ray/exceptions.py
Outdated
"due to node failure.\n\n" | ||
"If you did not receive a message about a worker node " | ||
"dying, this is likely a system-level bug. " | ||
"Please file an issue on GitHub at " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to remove "file an issue" logs. I think if I am a newbie user, I'd just think it is that Ray is unstable (since the message mentions the system level bug).
python/ray/exceptions.py
Outdated
"ObjectRef's owner can be updated. For example, suppose we " | ||
"call x_ref = foo.remote(...), pass [x_ref] to an actor A, and " | ||
"then actor A passes x_ref to actor B by calling " | ||
"B.bar.remote(x_ref). In this case, the driver may release x " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a long-living edge case we didn't handle. I remember I saw this from Stephanie's design doc?
python/ray/exceptions.py
Outdated
return super().__str__() + "\n\n" + ( | ||
f"Object {self.object_ref_hex} cannot be retrieved because " | ||
"the Python worker that first created the ObjectRef (via " | ||
"`.remote()` or `ray.put()`) has exited. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python worker that first created the ObjectRef (owner) via ".remote"...
so that users would understand the meaning of Owner (since it is mentioned in the Exception name)
python/ray/exceptions.py
Outdated
"the Python worker that first created the ObjectRef (via " | ||
"`.remote()` or `ray.put()`) has exited. " | ||
"This can happen because of node failure or " | ||
"a system-level bug.\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rather add a link to simple & easily explained ownership model doc instead?
Updated error messages, added a doc section on error types, and added owner logs info for the OwnerDiedError. |
doc/source/troubleshooting.rst
Outdated
task can be retried up to 3 times and an actor task cannot be retried. | ||
This can be overridden with the ``max_retries`` parameter for remote | ||
functions and the ``max_task_retries`` parameter for actors. | ||
- ``ReferenceCountingAssertionFailure``: The object has already been deleted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we link to a github issue here? I don't think we should document known bugs in the docs.
object_refs[i].call_site())) | ||
result.append(ObjectRef( | ||
object_refs[i].object_id(), | ||
object_refs[i].owner_address().SerializeAsString(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this add extra overheads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call takes about 1-2us, total .remote()
time is ~200us. I think this is okay, but I can run microbenchmarks if necessary.
python/ray/exceptions.py
Outdated
"more information about the failure.") | ||
|
||
|
||
class ObjectDeletedError(ObjectUnreachableError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to assertion failure class?
Could we guard this under the same flag as Mingwei's? It can add overhead
in the case he was benchmarking.
…On Wed, Sep 8, 2021, 4:56 PM Stephanie Wang ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In python/ray/_raylet.pyx
<#18292 (comment)>:
> @@ -186,8 +186,10 @@ cdef RayObjectsToDataMetadataPairs(
cdef VectorToObjectRefs(const c_vector[CObjectReference] &object_refs):
result = []
for i in range(object_refs.size()):
- result.append(ObjectRef(object_refs[i].object_id(),
- object_refs[i].call_site()))
+ result.append(ObjectRef(
+ object_refs[i].object_id(),
+ object_refs[i].owner_address().SerializeAsString(),
This call takes about 1-2us, total .remote() time is ~200us. I think this
is okay, but I can run microbenchmarks if necessary.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#18292 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSWCSEQA4XIASVG3IIDUA7ZZTANCNFSM5DHWUYNA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
You mean just the owner address part, right? Yeah I can do that, although I think it's unlikely to change anything. We already have to pass around the owner address when serializing/deserializing ObjectRefs, so this isn't adding any new data. |
I see, so this is just exposing it to Python. I guess the overhead does
seem very minimal then.
On Wed, Sep 8, 2021, 5:11 PM Stephanie Wang ***@***.***>
wrote:
… Could we guard this under the same flag as Mingwei's? It can add overhead
in the case he was benchmarking.
… <#m_-4787273352227793661_>
On Wed, Sep 8, 2021, 4:56 PM Stephanie Wang *@*.*> wrote: @.** commented
on this pull request. ------------------------------ In
python/ray/_raylet.pyx <#18292 (comment)
<#18292 (comment)>>: >
@@ -186,8 +186,10 @@ cdef RayObjectsToDataMetadataPairs( cdef
VectorToObjectRefs(const c_vector[CObjectReference] &object_refs): result =
[] for i in range(object_refs.size()): -
result.append(ObjectRef(object_refs[i].object_id(), -
object_refs[i].call_site())) + result.append(ObjectRef( +
object_refs[i].object_id(), +
object_refs[i].owner_address().SerializeAsString(), This call takes about
1-2us, total .remote() time is ~200us. I think this is okay, but I can run
microbenchmarks if necessary. — You are receiving this because you were
assigned. Reply to this email directly, view it on GitHub <#18292
(comment)
<#18292 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAADUSWCSEQA4XIASVG3IIDUA7ZZTANCNFSM5DHWUYNA
. Triage notifications on the go with GitHub Mobile for iOS
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
or Android
https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub
.
You mean just the owner address part, right? Yeah I can do that, although
I think it's unlikely to change anything. We already have to pass around
the owner address when serializing/deserializing ObjectRefs, so this isn't
adding any new data.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#18292 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSTXLUFJ2BF5P67W3ULUA73RLANCNFSM5DHWUYNA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
LGTM once tests pass |
python/ray/exceptions.py
Outdated
"more information about the failure.") | ||
|
||
|
||
class ObjectDeletedError(ObjectLostError, AssertionError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class ObjectDeletedError(ObjectLostError, AssertionError): | |
class RefCountAssertionError(ObjectLostError, AssertionError): |
buildkite/ray-builders-pr/setting-up-gpu-bootstrap-env-docker-tv looks unrelated. |
Why are these changes needed?
This separates the ObjectLostError into several different errors, all of which result in the object being unreachable:
Related issue number
Closes #14580.
Checks
scripts/format.sh
to lint the changes in this PR.