-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Add retry exception allowlist for user-defined filtering of retryable application-level errors. #25896
[Core] Add retry exception allowlist for user-defined filtering of retryable application-level errors. #25896
Conversation
6666e82
to
c561be3
Compare
With regard to the API, I feel if it's just a list of types (not that flexible as this one) which seems more friendly to the user? Will this be better:
From the implementation point of view, we can always convert the first two to the third one. |
@iycheng We discussed both in the RFC and internally, we settled on the exception predicate instead of an allowlist/denylist since it's more powerful, obviates the need for two separate list arguments to support both allowlists and denylists, and allows the user to multiplex on error metadata attached to the exception such as error/status codes. This also appears to be the more popular API choice for application-level retry libraries (not transport-level retry libraries, which standardize on the status codes defined in the underlying transport protocol). |
As for accepting both a predicate function and exceptions, implicitly turning the latter it into an allowlist predicate, I like that more than just having the allowlist! But won't that overly complicate the Another option would be providing predicate utilities similar to the Google Cloud Python SDK retry library. Concrete comparison below: # 1. Status quo: predicate function.
@ray.remote(retry_exceptions=lambda e: isinstance(e, (BrokenPipeError, ConnectionResetError)))
def foo:
pass
# 2. Predicate function + exception list
@ray.remote(retry_exceptions=[BrokenPipeError, ConnectionResetError])
def foo:
pass
# 3. Predicate function + helpers
@ray.remote(retry_exceptions=ray.util.retry.if_exception_type([BrokenPipeError, ConnectionResetError]))
def foo:
pass I feel like the predicate helpers are too verbose to be worth it... Overall I think that I'm leaning towards 2. |
69eb36d
to
9cc0112
Compare
@iycheng I added the |
9cc0112
to
e3ee2a4
Compare
""" | ||
if len(serialized_retry_exception_allowlist) == 0: | ||
# No exception allowlist specified, default to all retryable. | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never see an empty string coming from the Python frontend, but this can still happen in tests and may be allowed in cross-language scenarios, so we add this to be defensive.
ed9aa1e
to
4539bae
Compare
4539bae
to
788ff9f
Compare
Failures appear to be unrelated, merging. |
@clarkzinzow |
…ng of retryable application-level errors. (ray-project#25896)" This reverts commit 2a4d22f.
…ring of retryable application-level errors." (ray-project#26449) This reverts commit cf7305a, and unreverts ray-project#25896. This was reverted due to a failing Windows test: ray-project#26287 We can merge once the failing Windows test (and all other relevant tests) pass.
…ring of retryable application-level errors." (ray-project#26449) This reverts commit cf7305a, and unreverts ray-project#25896. This was reverted due to a failing Windows test: ray-project#26287 We can merge once the failing Windows test (and all other relevant tests) pass. Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
@@ -312,14 +312,17 @@ message TaskSpec { | |||
string concurrency_group_name = 24; | |||
// Whether application-level errors (exceptions) should be retried. | |||
bool retry_exceptions = 25; | |||
// A serialized exception list that serves as an allowlist of frontend-language | |||
// exceptions/errors that should be retried. | |||
bytes serialized_retry_exception_allowlist = 26; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want to call out the backwards-incompatibility of this change. Perhaps not that big of a deal today given we don't support rolling upgrades to the cluster, but it is still good to stay in the habit of making only backwards compatible change when possible
Both Ray libraries (Workflows, Datasets, etc.) and end-users need to be able to retry application-level errors; this is currently supported via the
retry_exceptions
argument. However, not all application-level errors are transient, and therefore not all application-level errors should be retried.This PR adds supported for specifying an exception allowlist (
List[Exception]
) as theretry_exceptions
argument, such that an application-level exception will only be retried if it is in the allowlist.Exception Allowlist
See the RFC for more motivation and API design discussion: #25743
Related issue number
Closes #25743
Checks
scripts/format.sh
to lint the changes in this PR.