Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][usability] Disambiguate ObjectLostErrors for better understandability #18292

Merged
Merged
104 changes: 97 additions & 7 deletions python/ray/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ def __str__(self):
# due to the dependency failure.
# Print out an user-friendly
# message to explain that..
out.append(" Some of the input arguments for "
out.append(" At least one of the input arguments for "
"this task could not be computed:")
if i + 1 < len(lines) and lines[i + 1].startswith(" "):
# If the next line is indented with 2 space,
Expand Down Expand Up @@ -279,8 +279,11 @@ def __str__(self):
"to list active objects in the cluster.")


class ObjectLostError(RayError):
"""Indicates that an object has been lost due to node failure.
# TODO(XXX): Replace with ObjectLostError for backwards compatibility once all
# Python tests pass.
class ObjectUnreachableError(RayError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ObjectUnreachableError(RayError):
class ObjectLostError(RayError):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll switch this back later, just putting this here for now so that I can see which Python tests would fail right now.

"""Generic error for objects that are unreachable due to node failure or
system error.

Attributes:
object_ref_hex: Hex ID of the object.
Expand All @@ -292,19 +295,103 @@ def __init__(self, object_ref_hex, call_site):
ray_constants.CALL_STACK_LINE_DELIMITER, "\n ")

def __str__(self):
msg = (f"Object {self.object_ref_hex} cannot be retrieved due to node "
"failure or system error.")
msg = f"Object {self.object_ref_hex} cannot be retrieved. "
if self.call_site:
msg += (f" The ObjectRef was created at: {self.call_site}")
msg += (f"The ObjectRef was created at: {self.call_site}")
else:
msg += (
" To see information about where this ObjectRef was created "
"To see information about where this ObjectRef was created "
"in Python, set the environment variable "
"RAY_record_ref_creation_sites=1 during `ray start` and "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to set this for driver. Why don't we write a doc and make a link instead? (I think it is confusing for users anyway when they just read it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a doc, but left it out of the error message for now. I think it's best if we try to make the error messages standalone.

"`ray.init()`.")
return msg


class ObjectLostError(ObjectUnreachableError):
"""Indicates that the object is lost from distributed memory, due to
node failure or system error.

Attributes:
object_ref_hex: Hex ID of the object.
"""

def __str__(self):
return super().__str__() + "\n\n" + (
f"All copies of {self.object_ref_hex} are lost "
"due to node failure.\n\n"
"If you did not receive a message about a worker node "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, but I think we should write a debugging guideline doc and make a link here

"dying, this is likely a system-level bug. "
"Please file an issue on GitHub at "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to remove "file an issue" logs. I think if I am a newbie user, I'd just think it is that Ray is unstable (since the message mentions the system level bug).

"https://github.com/ray-project/ray/issues/new/choose.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following https://spark.apache.org/error-message-guidelines.html, this can be rewritten as

"All copies of {obj} have been lost due to node failure. Check cluster logs for more information about the failure." (be direct).



class ObjectReleasedError(ObjectUnreachableError):
"""Indicates that an object has been released while there was still a
reference to it.

Attributes:
object_ref_hex: Hex ID of the object.
"""

def __str__(self):
return super().__str__() + "\n\n" + (
f"Object {self.object_ref_hex} has already been released.\n\n"
Copy link
Contributor

@ericl ericl Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReferenceCountingAssertionFailure: Attempted to retrieve an already-deleted object. This should not happen.

"This is likely due to a corner case in the distributed "
"reference counting protocol that can occur when a worker passes "
"an ObjectRef, then exits before the ref count at the "
"ObjectRef's owner can be updated. For example, suppose we "
"call x_ref = foo.remote(...), pass [x_ref] to an actor A, and "
"then actor A passes x_ref to actor B by calling "
"B.bar.remote(x_ref). In this case, the driver may release x "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a P0 bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a long-living edge case we didn't handle. I remember I saw this from Stephanie's design doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's an existing corner case.

"before B.bar executes, and B will receive this error.\n\n"
"If your application does not match this scenario, then this is "
"likely a system-level bug in the distributed ref counting "
"protocol. Please file an issue on GitHub at "
"https://github.com/ray-project/ray/issues/new/choose.")


class OwnerDiedError(ObjectUnreachableError):
"""Indicates that the owner of the object has died while there is still a
reference to the object.

Attributes:
object_ref_hex: Hex ID of the object.
"""

def __str__(self):
return super().__str__() + "\n\n" + (
f"Object {self.object_ref_hex} cannot be retrieved because "
"the Python worker that first created the ObjectRef (via "
"`.remote()` or `ray.put()`) has exited. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python worker that first created the ObjectRef (owner) via ".remote"... so that users would understand the meaning of Owner (since it is mentioned in the Exception name)

"This can happen because of node failure or "
"a system-level bug.\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rather add a link to simple & easily explained ownership model doc instead?

"If you did not receive a message about a worker node "
"dying, this is likely a system-level bug. "
"Please file an issue on GitHub at "
"https://github.com/ray-project/ray/issues/new/choose.")


class ObjectReconstructionFailedError(ObjectUnreachableError):
"""Indicates that the owner of the object has died while there is still a
reference to the object.

Attributes:
object_ref_hex: Hex ID of the object.
"""

def __str__(self):
return super().__str__() + "\n\n" + (
f"Attempted lineage reconstruction to recover object "
"{self.object_ref_hex}, but recovery failed. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The object could not be reconstructed since the max number of retries was exceeded.

"This can happen if the task that creates this "
"object, or an object that this object depends on, "
"has already been executed up to its maximum number of "
"retries (3 for normal tasks, 0 fo actor tasks).\n\n"
"Lineage reconstruction is under active development. "
"If you see this error, please file an issue at "
"https://github.com/ray-project/ray/issues/new/choose.")


class GetTimeoutError(RayError):
"""Indicates that a call to the worker timed out."""
pass
Expand Down Expand Up @@ -338,6 +425,9 @@ def __str__(self):
RayActorError,
ObjectStoreFullError,
ObjectLostError,
ObjectReleasedError,
ObjectReconstructionFailedError,
OwnerDiedError,
GetTimeoutError,
AsyncioActorExit,
RuntimeEnvSetupError,
Expand Down
24 changes: 16 additions & 8 deletions python/ray/serialization.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,11 @@
from ray import ray_constants
import ray._private.utils
from ray._private.gcs_utils import ErrorType
from ray.exceptions import (RayError, PlasmaObjectNotAvailable, RayTaskError,
RayActorError, TaskCancelledError,
WorkerCrashedError, ObjectLostError,
RaySystemError, RuntimeEnvSetupError)
from ray.exceptions import (
RayError, PlasmaObjectNotAvailable, RayTaskError, RayActorError,
TaskCancelledError, WorkerCrashedError, ObjectLostError,
ObjectReleasedError, OwnerDiedError, ObjectReconstructionFailedError,
RaySystemError, RuntimeEnvSetupError)
from ray._raylet import (
split_buffer,
unpack_pickle5_buffers,
Expand Down Expand Up @@ -222,15 +223,22 @@ def _deserialize_object(self, data, metadata, object_ref):
return RayActorError()
elif error_type == ErrorType.Value("TASK_CANCELLED"):
return TaskCancelledError()
elif error_type == ErrorType.Value("OBJECT_UNRECONSTRUCTABLE"):
elif error_type == ErrorType.Value("OBJECT_LOST"):
return ObjectLostError(object_ref.hex(),
object_ref.call_site())
elif error_type == ErrorType.Value("OBJECT_RELEASED"):
return ObjectReleasedError(object_ref.hex(),
object_ref.call_site())
elif error_type == ErrorType.Value("OWNER_DIED"):
return OwnerDiedError(object_ref.hex(), object_ref.call_site())
elif error_type == ErrorType.Value("OBJECT_UNRECONSTRUCTABLE"):
return ObjectReconstructionFailedError(object_ref.hex(),
object_ref.call_site())
elif error_type == ErrorType.Value("RUNTIME_ENV_SETUP_FAILED"):
return RuntimeEnvSetupError()
else:
assert error_type != ErrorType.Value("OBJECT_IN_PLASMA"), \
"Tried to get object that has been promoted to plasma."
assert False, "Unrecognized error type " + str(error_type)
return RaySystemError("Unrecognized error type " +
str(error_type))
elif data:
raise ValueError("non-null object should always have metadata")
else:
Expand Down
4 changes: 2 additions & 2 deletions python/ray/tests/test_cancel.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,14 @@
import ray
from ray.exceptions import TaskCancelledError, RayTaskError, \
GetTimeoutError, WorkerCrashedError, \
ObjectLostError
ObjectUnreachableError
from ray._private.test_utils import SignalActor


def valid_exceptions(use_force):
if use_force:
return (RayTaskError, TaskCancelledError, WorkerCrashedError,
ObjectLostError)
ObjectUnreachableError)
else:
return (RayTaskError, TaskCancelledError)

Expand Down
4 changes: 2 additions & 2 deletions python/ray/tests/test_client_terminate.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@
from ray.exceptions import TaskCancelledError
from ray.exceptions import RayTaskError
from ray.exceptions import WorkerCrashedError
from ray.exceptions import ObjectLostError
from ray.exceptions import ObjectUnreachableError
from ray.exceptions import GetTimeoutError


def valid_exceptions(use_force):
if use_force:
return (RayTaskError, TaskCancelledError, WorkerCrashedError,
ObjectLostError)
ObjectUnreachableError)
else:
return (RayTaskError, TaskCancelledError)

Expand Down
Loading