[core][usability] Disambiguate ObjectLostErrors for better understandability #18292

stephanie-wang · 2021-09-02T00:25:28Z

Why are these changes needed?

This separates the ObjectLostError into several different errors, all of which result in the object being unreachable:

ObjectLostError: object has no locations in distributed memory due to node failure
OwnerDiedError: object's owner died
ObjectReleasedError: object's owner is alive, but the object was already released, probably due to ref counting issue.
ObjectReconstructionFailedError: only thrown if lineage reconstruction is enabled, and an object (or its dependency) failed to be reconstructed.

Related issue number

Closes #14580.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…ct-lost

stephanie-wang · 2021-09-02T00:26:48Z

The docstrings are pretty self-explanatory (I hope), but I should probably update the docs around error handling also, might do this in a follow-up PR.

Also, I'm planning on keeping the ObjectLostError and having all other errors subclass that, for backwards compatibility. But for now, I've defined a generic ObjectUnreachableError as the superclass so we can catch any broken tests in the CI.

rkooo567

Mostly LGTM. Have one comment regarding how to handle already-evicted objects for OBOD

rkooo567 · 2021-09-02T05:32:45Z

src/ray/object_manager/ownership_based_object_directory.cc

+          location_info, object_id,
+          /*location_lookup_failed*/ !location_info.ref_removed());
+      if (location_info.ref_removed()) {
+        mark_as_failed_(object_id, rpc::ErrorType::OBJECT_RELEASED);


I guess this is probably not necessary? This APIs are used to pull new objects, and I feel like this error should be already propagated by the higher level layer (in the future resolution)? I also have seen this this happens A LOT, and kind of worried that it will create so many synchronous calls to plasma store.

Did you happen to find when this is needed?

I think we do want to throw the error because it can be a real reference counting issue. But I agree we shouldn't log WARNINGs.

rkooo567 · 2021-09-02T05:36:50Z

src/ray/core_worker/reference_count.cc

@@ -1151,9 +1151,13 @@ Status ReferenceCounter::FillObjectInformation(
  absl::MutexLock lock(&mutex_);
  auto it = object_id_refs_.find(object_id);
  if (it == object_id_refs_.end()) {
-    return Status::ObjectNotFound("Object " + object_id.Hex() + " not found");
+    RAY_LOG(WARNING) << "Object locations requested for " << object_id


Same comment as below, but I've seen this happens a lot I think because this is a common scenario;

OBOD send a request

Object evicted

OBOD request arrived

Then it is no-op

I think we are handling this in the low level layer (object manager)? so my impression is having WARNING logs can have misleading information

Actually I don't understand why this is happening a lot? We shouldn't be evicting the object until refs are out of scope, so in the common case, OBOD requests should always succeed.

Yeah this isn't a WARN, it's an expected race condition (DEBUG).

The reason it can always happen is if OBOD RPCs arrive late, or stale object fetches in pull manager. In general, we should avoid these kind of spammy warning messages when the system can handle stale/outdated requests, it's bitten us several times in the past.

Got it, thanks!

ericl

High level request to make the error messages follow the style guidelines: https://spark.apache.org/error-message-guidelines.html

ericl · 2021-09-02T18:49:46Z

python/ray/exceptions.py

-    """Indicates that an object has been lost due to node failure.
+# TODO(XXX): Replace with ObjectLostError for backwards compatibility once all
+# Python tests pass.
+class ObjectUnreachableError(RayError):


Suggested change

class ObjectUnreachableError(RayError):

class ObjectLostError(RayError):

I'll switch this back later, just putting this here for now so that I can see which Python tests would fail right now.

ericl · 2021-09-02T18:56:48Z

python/ray/exceptions.py

+            "If you did not receive a message about a worker node "
+            "dying, this is likely a system-level bug. "
+            "Please file an issue on GitHub at "
+            "https://github.com/ray-project/ray/issues/new/choose.")


Following https://spark.apache.org/error-message-guidelines.html, this can be rewritten as

"All copies of {obj} have been lost due to node failure. Check cluster logs for more information about the failure." (be direct).

ericl · 2021-09-02T18:58:08Z

python/ray/exceptions.py

+            "ObjectRef's owner can be updated. For example, suppose we "
+            "call x_ref = foo.remote(...), pass [x_ref] to an actor A, and "
+            "then actor A passes x_ref to actor B by calling "
+            "B.bar.remote(x_ref). In this case, the driver may release x "


Is this a P0 bug?

I think this is a long-living edge case we didn't handle. I remember I saw this from Stephanie's design doc?

Yes it's an existing corner case.

ericl · 2021-09-02T18:59:39Z

python/ray/exceptions.py

+
+    def __str__(self):
+        return super().__str__() + "\n\n" + (
+            f"Object {self.object_ref_hex} has already been released.\n\n"


ReferenceCountingAssertionFailure: Attempted to retrieve an already-deleted object. This should not happen.

ericl · 2021-09-02T19:01:22Z

python/ray/exceptions.py

+    def __str__(self):
+        return super().__str__() + "\n\n" + (
+            f"Attempted lineage reconstruction to recover object "
+            "{self.object_ref_hex}, but recovery failed. "


The object could not be reconstructed since the max number of retries was exceeded.

rkooo567 · 2021-09-03T06:20:57Z

python/ray/exceptions.py

        else:
            msg += (
-                " To see information about where this ObjectRef was created "
+                "To see information about where this ObjectRef was created "
                "in Python, set the environment variable "
                "RAY_record_ref_creation_sites=1 during `ray start` and "


We also need to set this for driver. Why don't we write a doc and make a link instead? (I think it is confusing for users anyway when they just read it)

I added a doc, but left it out of the error message for now. I think it's best if we try to make the error messages standalone.

rkooo567 · 2021-09-03T06:24:15Z

python/ray/exceptions.py

+        return super().__str__() + "\n\n" + (
+            f"All copies of {self.object_ref_hex} are lost "
+            "due to node failure.\n\n"
+            "If you did not receive a message about a worker node "


Just a thought, but I think we should write a debugging guideline doc and make a link here

rkooo567 · 2021-09-03T06:24:55Z

python/ray/exceptions.py

+            "due to node failure.\n\n"
+            "If you did not receive a message about a worker node "
+            "dying, this is likely a system-level bug. "
+            "Please file an issue on GitHub at "


Prefer to remove "file an issue" logs. I think if I am a newbie user, I'd just think it is that Ray is unstable (since the message mentions the system level bug).

rkooo567 · 2021-09-03T06:26:11Z

python/ray/exceptions.py

+            "ObjectRef's owner can be updated. For example, suppose we "
+            "call x_ref = foo.remote(...), pass [x_ref] to an actor A, and "
+            "then actor A passes x_ref to actor B by calling "
+            "B.bar.remote(x_ref). In this case, the driver may release x "


I think this is a long-living edge case we didn't handle. I remember I saw this from Stephanie's design doc?

rkooo567 · 2021-09-03T06:27:13Z

python/ray/exceptions.py

+        return super().__str__() + "\n\n" + (
+            f"Object {self.object_ref_hex} cannot be retrieved because "
+            "the Python worker that first created the ObjectRef (via "
+            "`.remote()` or `ray.put()`) has exited. "


Python worker that first created the ObjectRef (owner) via ".remote"... so that users would understand the meaning of Owner (since it is mentioned in the Exception name)

rkooo567 · 2021-09-03T06:27:39Z

python/ray/exceptions.py

+            "the Python worker that first created the ObjectRef (via "
+            "`.remote()` or `ray.put()`) has exited. "
+            "This can happen because of node failure or "
+            "a system-level bug.\n\n"


I think we should rather add a link to simple & easily explained ownership model doc instead?

stephanie-wang · 2021-09-08T04:34:51Z

Updated error messages, added a doc section on error types, and added owner logs info for the OwnerDiedError.

ericl · 2021-09-08T22:32:17Z

doc/source/troubleshooting.rst

+  task can be retried up to 3 times and an actor task cannot be retried.
+  This can be overridden with the ``max_retries`` parameter for remote
+  functions and the ``max_task_retries`` parameter for actors.
+- ``ReferenceCountingAssertionFailure``: The object has already been deleted,


How about we link to a github issue here? I don't think we should document known bugs in the docs.

ericl · 2021-09-08T22:32:58Z

python/ray/_raylet.pyx

-                                object_refs[i].call_site()))
+        result.append(ObjectRef(
+            object_refs[i].object_id(),
+            object_refs[i].owner_address().SerializeAsString(),


Does this add extra overheads?

This call takes about 1-2us, total .remote() time is ~200us. I think this is okay, but I can run microbenchmarks if necessary.

ericl · 2021-09-08T22:33:38Z

python/ray/exceptions.py

+            "more information about the failure.")
+
+
+class ObjectDeletedError(ObjectUnreachableError):


Change to assertion failure class?

ericl · 2021-09-09T00:06:40Z

Could we guard this under the same flag as Mingwei's? It can add overhead in the case he was benchmarking.

…

On Wed, Sep 8, 2021, 4:56 PM Stephanie Wang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/_raylet.pyx <#18292 (comment)>: > @@ -186,8 +186,10 @@ cdef RayObjectsToDataMetadataPairs( cdef VectorToObjectRefs(const c_vector[CObjectReference] &object_refs): result = [] for i in range(object_refs.size()): - result.append(ObjectRef(object_refs[i].object_id(), - object_refs[i].call_site())) + result.append(ObjectRef( + object_refs[i].object_id(), + object_refs[i].owner_address().SerializeAsString(), This call takes about 1-2us, total .remote() time is ~200us. I think this is okay, but I can run microbenchmarks if necessary. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#18292 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSWCSEQA4XIASVG3IIDUA7ZZTANCNFSM5DHWUYNA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

stephanie-wang · 2021-09-09T00:10:50Z

Could we guard this under the same flag as Mingwei's? It can add overhead in the case he was benchmarking.
…
On Wed, Sep 8, 2021, 4:56 PM Stephanie Wang @.> wrote: @.* commented on this pull request. ------------------------------ In python/ray/_raylet.pyx <#18292 (comment)>: > @@ -186,8 +186,10 @@ cdef RayObjectsToDataMetadataPairs( cdef VectorToObjectRefs(const c_vector[CObjectReference] &object_refs): result = [] for i in range(object_refs.size()): - result.append(ObjectRef(object_refs[i].object_id(), - object_refs[i].call_site())) + result.append(ObjectRef( + object_refs[i].object_id(), + object_refs[i].owner_address().SerializeAsString(), This call takes about 1-2us, total .remote() time is ~200us. I think this is okay, but I can run microbenchmarks if necessary. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#18292 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSWCSEQA4XIASVG3IIDUA7ZZTANCNFSM5DHWUYNA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You mean just the owner address part, right? Yeah I can do that, although I think it's unlikely to change anything. We already have to pass around the owner address when serializing/deserializing ObjectRefs, so this isn't adding any new data.

ericl · 2021-09-09T00:19:11Z

I see, so this is just exposing it to Python. I guess the overhead does seem very minimal then. On Wed, Sep 8, 2021, 5:11 PM Stephanie Wang ***@***.***> wrote:

…

Could we guard this under the same flag as Mingwei's? It can add overhead in the case he was benchmarking. … <#m_-4787273352227793661_> On Wed, Sep 8, 2021, 4:56 PM Stephanie Wang *@*.*> wrote: @.** commented on this pull request. ------------------------------ In python/ray/_raylet.pyx <#18292 (comment) <#18292 (comment)>>: > @@ -186,8 +186,10 @@ cdef RayObjectsToDataMetadataPairs( cdef VectorToObjectRefs(const c_vector[CObjectReference] &object_refs): result = [] for i in range(object_refs.size()): - result.append(ObjectRef(object_refs[i].object_id(), - object_refs[i].call_site())) + result.append(ObjectRef( + object_refs[i].object_id(), + object_refs[i].owner_address().SerializeAsString(), This call takes about 1-2us, total .remote() time is ~200us. I think this is okay, but I can run microbenchmarks if necessary. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#18292 (comment) <#18292 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSWCSEQA4XIASVG3IIDUA7ZZTANCNFSM5DHWUYNA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . You mean just the owner address part, right? Yeah I can do that, although I think it's unlikely to change anything. We already have to pass around the owner address when serializing/deserializing ObjectRefs, so this isn't adding any new data. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#18292 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSTXLUFJ2BF5P67W3ULUA73RLANCNFSM5DHWUYNA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

rkooo567 · 2021-09-09T07:45:04Z

LGTM once tests pass

ericl · 2021-09-09T19:11:25Z

python/ray/exceptions.py

+            "more information about the failure.")
+
+
+class ObjectDeletedError(ObjectLostError, AssertionError):


Suggested change

class ObjectDeletedError(ObjectLostError, AssertionError):

class RefCountAssertionError(ObjectLostError, AssertionError):

…ct-lost

stephanie-wang · 2021-09-13T23:16:13Z

buildkite/ray-builders-pr/setting-up-gpu-bootstrap-env-docker-tv looks unrelated.

stephanie-wang added 9 commits August 30, 2021 17:50

Define error types, throw error for ObjectReleased

6cdcca8

x

50d761f

Disambiguate OBJECT_UNRECONSTRUCTABLE and OBJECT_LOST

e8d5f21

OwnerDiedError

8916315

fix test

d7d5b4c

Merge remote-tracking branch 'upstream/master' into disambiguate-obje…

8dd540d

…ct-lost

x

2c06021

ObjectReconstructionFailed

16b1af5

ObjectReconstructionFailed

506d1fb

stephanie-wang requested review from AmeerHajAli, ericl, pcmoritz, robertnishihara and wuisawesome as code owners September 2, 2021 00:25

stephanie-wang assigned ericl Sep 2, 2021

rkooo567 self-assigned this Sep 2, 2021

x

bd79ca6

rkooo567 reviewed Sep 2, 2021

View reviewed changes

ericl requested changes Sep 2, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 2, 2021

rkooo567 reviewed Sep 3, 2021

View reviewed changes

stephanie-wang added 2 commits September 6, 2021 21:51

x

ed861be

print owner addr

5a5fbf1

stephanie-wang requested a review from raulchen as a code owner September 8, 2021 04:32

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 8, 2021

ericl reviewed Sep 8, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 8, 2021

str

f493524

doc

33e0187

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 9, 2021

rkooo567 approved these changes Sep 9, 2021

View reviewed changes

ericl requested changes Sep 9, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 9, 2021

stephanie-wang added 2 commits September 9, 2021 16:25

rename

1c0922c

x

af61dab

ericl approved these changes Sep 11, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into disambiguate-obje…

1b214da

…ct-lost

stephanie-wang merged commit 284dee4 into ray-project:master Sep 13, 2021

stephanie-wang deleted the disambiguate-object-lost branch September 13, 2021 23:16

	class ObjectUnreachableError(RayError):
	class ObjectLostError(RayError):

		"more information about the failure.")


		class ObjectDeletedError(ObjectUnreachableError):

		"more information about the failure.")


		class ObjectDeletedError(ObjectLostError, AssertionError):

	class ObjectDeletedError(ObjectLostError, AssertionError):
	class RefCountAssertionError(ObjectLostError, AssertionError):

[core][usability] Disambiguate ObjectLostErrors for better understandability #18292

[core][usability] Disambiguate ObjectLostErrors for better understandability #18292

Conversation

stephanie-wang commented Sep 2, 2021

Why are these changes needed?

Related issue number

Checks

stephanie-wang commented Sep 2, 2021

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Sep 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Sep 9, 2021 via email

stephanie-wang commented Sep 9, 2021

ericl commented Sep 9, 2021 via email

rkooo567 commented Sep 9, 2021

Choose a reason for hiding this comment

stephanie-wang commented Sep 13, 2021

stephanie-wang Sep 2, 2021 •

edited

Loading

ericl Sep 2, 2021 •

edited

Loading