Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add retry exception allowlist for user-defined filtering of retryable application-level errors. #25896

Merged

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Jun 17, 2022

Both Ray libraries (Workflows, Datasets, etc.) and end-users need to be able to retry application-level errors; this is currently supported via the retry_exceptions argument. However, not all application-level errors are transient, and therefore not all application-level errors should be retried.

This PR adds supported for specifying an exception allowlist (List[Exception]) as the retry_exceptions argument, such that an application-level exception will only be retried if it is in the allowlist.

Exception Allowlist

@ray.remote(
    retry_exceptions=ConnectionError,
    max_retries=3,
)
def f(a):
    if a < 0:
        raise ValueError("invalid input")
    else:
        # connect_and_add might transiently raise a ConnectionError
        return connect_and_add(a)

# Fails without retrying the non-transient error.
ray.get(f.remote(-1))

# Retries up to 3 times.
ray.get(f.remote(5))

See the RFC for more motivation and API design discussion: #25743

Related issue number

Closes #25743

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone
Copy link
Contributor

With regard to the API, I feel if it's just a list of types (not that flexible as this one) which seems more friendly to the user?

Will this be better:

    retry_exceptions: Union[Exception, List[Exception], Callable[[Exception], bool]]

From the implementation point of view, we can always convert the first two to the third one.

@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Jun 17, 2022

@iycheng We discussed both in the RFC and internally, we settled on the exception predicate instead of an allowlist/denylist since it's more powerful, obviates the need for two separate list arguments to support both allowlists and denylists, and allows the user to multiplex on error metadata attached to the exception such as error/status codes.

This also appears to be the more popular API choice for application-level retry libraries (not transport-level retry libraries, which standardize on the status codes defined in the underlying transport protocol).

@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Jun 17, 2022

As for accepting both a predicate function and exceptions, implicitly turning the latter it into an allowlist predicate, I like that more than just having the allowlist! But won't that overly complicate the retry_exceptions type? It's already Union[bool, Callabe[[Exception], bool]], this would turn it into Union[bool, Exception, List[Exception], Callable[[Exception], bool]].

Another option would be providing predicate utilities similar to the Google Cloud Python SDK retry library. Concrete comparison below:

# 1. Status quo: predicate function.
@ray.remote(retry_exceptions=lambda e: isinstance(e, (BrokenPipeError, ConnectionResetError)))
def foo:
    pass

# 2. Predicate function + exception list
@ray.remote(retry_exceptions=[BrokenPipeError, ConnectionResetError])
def foo:
    pass

# 3. Predicate function + helpers
@ray.remote(retry_exceptions=ray.util.retry.if_exception_type([BrokenPipeError, ConnectionResetError]))
def foo:
    pass

I feel like the predicate helpers are too verbose to be worth it... Overall I think that I'm leaning towards 2.

@clarkzinzow clarkzinzow force-pushed the core/feat/task-retry-predicate branch from 69eb36d to 9cc0112 Compare June 17, 2022 23:54
@clarkzinzow
Copy link
Contributor Author

@iycheng I added the Union[Exception, List[Exception]] convenience option.

@clarkzinzow clarkzinzow added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 21, 2022
@clarkzinzow
Copy link
Contributor Author

@ericl @rkooo567 @pcmoritz API and implementation ready for review.

@clarkzinzow clarkzinzow force-pushed the core/feat/task-retry-predicate branch from 9cc0112 to e3ee2a4 Compare June 22, 2022 08:00
@clarkzinzow clarkzinzow requested a review from ericl June 22, 2022 23:22
"""
if len(serialized_retry_exception_allowlist) == 0:
# No exception allowlist specified, default to all retryable.
return True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never see an empty string coming from the Python frontend, but this can still happen in tests and may be allowed in cross-language scenarios, so we add this to be defensive.

@clarkzinzow clarkzinzow requested a review from jjyao July 1, 2022 19:33
@clarkzinzow clarkzinzow changed the title [Core] Add retry exception predicate for user-defined filtering of retryable application-level errors. [Core] Add retry exception allowlist for user-defined filtering of retryable application-level errors. Jul 1, 2022
@clarkzinzow clarkzinzow force-pushed the core/feat/task-retry-predicate branch 2 times, most recently from ed9aa1e to 4539bae Compare July 1, 2022 19:50
@clarkzinzow clarkzinzow force-pushed the core/feat/task-retry-predicate branch from 4539bae to 788ff9f Compare July 1, 2022 20:37
@clarkzinzow
Copy link
Contributor Author

@jjyao @ericl Feedback has been implemented! I'm planning on merging once the build completes (assuming no new failures), so lmk if there are any tweaks that should be made before merging.

@clarkzinzow
Copy link
Contributor Author

Failures appear to be unrelated, merging.

@clarkzinzow clarkzinzow merged commit 2a4d22f into ray-project:master Jul 2, 2022
@SongGuyang
Copy link
Contributor

SongGuyang commented Jul 4, 2022

@clarkzinzow windows://python/ray/tests:test_output failed after this PR.

SongGuyang added a commit to antgroup/ant-ray that referenced this pull request Jul 5, 2022
…ng of retryable application-level errors. (ray-project#25896)"

This reverts commit 2a4d22f.
clarkzinzow added a commit that referenced this pull request Aug 5, 2022
…ring of retryable application-level errors." (#26449)

This reverts commit cf7305a, and unreverts #25896.

This was reverted due to a failing Windows test: #26287

We can merge once the failing Windows test (and all other relevant tests) pass.
clarkzinzow added a commit to clarkzinzow/ray that referenced this pull request Aug 5, 2022
…ring of retryable application-level errors." (ray-project#26449)

This reverts commit cf7305a, and unreverts ray-project#25896.

This was reverted due to a failing Windows test: ray-project#26287

We can merge once the failing Windows test (and all other relevant tests) pass.
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…ring of retryable application-level errors." (ray-project#26449)

This reverts commit cf7305a, and unreverts ray-project#25896.

This was reverted due to a failing Windows test: ray-project#26287

We can merge once the failing Windows test (and all other relevant tests) pass.

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
@@ -312,14 +312,17 @@ message TaskSpec {
string concurrency_group_name = 24;
// Whether application-level errors (exceptions) should be retried.
bool retry_exceptions = 25;
// A serialized exception list that serves as an allowlist of frontend-language
// exceptions/errors that should be retried.
bytes serialized_retry_exception_allowlist = 26;
Copy link
Contributor

@clarng clarng Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to call out the backwards-incompatibility of this change. Perhaps not that big of a deal today given we don't support rolling upgrades to the cluster, but it is still good to stay in the habit of making only backwards compatible change when possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. core-interface-change-approval-required This changes the Ray core behavior / API and requires broader approvals.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] [RFC] Add ability to filter application-level exceptions that should be retried.
9 participants