Don't retry on internal exceptions #1110

Drvi · 2023-09-14T12:30:02Z

Previously, we'd retry on exceptions happening inside layers, e.g. when cloudsign layer in CloudStore.jl decodes base64 encoded SAS tokens and if those are malformed we could get an ArgumentException

nickrobinson251

can you explain why this is the right solution?

nickrobinson251 · 2023-09-14T12:33:34Z

src/clientlayers/RetryRequest.jl

@@ -77,6 +77,9 @@ function retrylayer(handler)
 end

 isrecoverable(ex) = true
+isrecoverable(ex::ArgumentError) = false


is isrecoverable(ex) = true just the wrong default?

e.g. what if CloudBase stops throwing a ArgumentError and decides to throw a MalformedTokenError?

can you explain why this is the right solution?

This seemed like the least disruptive fix for the problem we've seen. Maybe the right solution is to formalize the contract a layer needs to fulfill wrt throwing exceptions.

is isrecoverable(ex) = true just the wrong default?

I think Jacob planned a bigger refactor for retries, switching exception retries from opt-out to opt-in would make sense to me.

what if CloudBase stops throwing a ArgumentError and decides to throw a MalformedTokenError?

I understand, but I'm less worried about this possibility since this is the first instance of this issue we've seen and I don't expect the underlying packages to make such changes without us noticing.

i think this bug (i.e. we'd retry on exceptions happening inside layers) shows that isrecoverable(ex) = true is the wrong default. What would be the disruption of flipping that to isrecoverable(ex) = false?

I don't expect the underlying packages to make such changes without us noticing

packages like CloudBase.jl are users on HTTP.jl (not dependencies), and we don't have much visibility on users, so i am worried about other users running into this same issue (just instead of via an ArgumentError via a MethodError or a AssertionError or a BoundError or ...)

If @quinnj is not around, perhaps we can get a third view from @NHDaly (and/or whoever was running into this issue)?

Looks like Blame shows this as the PR that set that as the default: #990

Looks like maybe the PR was trying to change to always retry if the request is retryable (has a retryable body or is idempotent)? But yeah... it does seem like a strange default...

Do you have any memory of why that PR was made, @nickrobinson251?

i think that PR was a fix for some oversight in the preceding rewrite: https://github.com/JuliaWeb/HTTP.jl/pull/974/files

I don't believe that rewrite PR intended to change any behaviour, only increase flexibility. The description (from the previous draft of the same thing #877) was

this PR proposes more configurability for retry logic in the hopes that users won't opt to roll their own incorrect retry logic.

and i notice that before that refactor we had

isrecoverable(e) = false isrecoverable(e::Union{Base.EOFError, Base.IOError, MbedTLS.MbedException, OpenSSL.OpenSSLError}) = true isrecoverable(e::ArgumentError) = e.msg == "stream is closed or unusable" isrecoverable(e::Sockets.DNSError) = true isrecoverable(e::ConnectError) = true isrecoverable(e::RequestError) = isrecoverable(e.error) isrecoverable(e::StatusError) = retryable(e.status)

which makes me think the default of true was unintended

Amusingly, there is now a failing test:

@test HTTP.RetryRequest.isrecoverable(ErrorException(""))

so my attempt to make an improvement with as little behavior change as possible was doomed to fail from the beginning.

What would be the disruption of flipping that to isrecoverable(ex) = false?

The fact that we wouldn't retry because we relied on certain exceptions being considered recoverable implicitly. Even this current PR seem to have missed isrecoverable(e::ArgumentError) = e.msg == "stream is closed or unusable" which is unfortunate.

Drvi · 2023-09-14T15:58:34Z

Here are all mentions of isrecoverable, via git log -pG isrecoverable > isrecoverable.txt
isrecoverable.txt

Features an amazing ascii art for docs, where did it go? Would be great for extended help docstrings.

Drvi · 2023-09-14T17:29:57Z

Here is a short history of isrecoverable:

isrecoverable(e) = false
isrecoverable(e::Union{Base.EOFError, Base.IOError, MbedTLS.MbedException, OpenSSL.OpenSSLError}) = true
isrecoverable(e::ArgumentError) = e.msg == "stream is closed or unusable"
isrecoverable(e::Sockets.DNSError) = true
isrecoverable(e::ConnectError) = true
isrecoverable(e::RequestError) = isrecoverable(e.error)
isrecoverable(e::StatusError) = retryable(e.status)

->

# yes, nothing -- this effectively meant `isrecoverable(ex) = true`

->

isrecoverable(ex) = true
isrecoverable(ex::StatusError) = retryable(ex.status)

->

isrecoverable(ex) = true
isrecoverable(ex::ConnectError) = ex.error isa Sockets.DNSError && ex.error.code == EAI_AGAIN ? false : true
isrecoverable(ex::StatusError) = retryable(ex.status)

->

isrecoverable(ex) = true
isrecoverable(ex::CapturedException) = isrecoverable(ex.ex)
isrecoverable(ex::ConnectError) = ex.error isa Sockets.DNSError && ex.error.code == EAI_AGAIN ? false : true
isrecoverable(ex::StatusError) = retryable(ex.status)

->

isrecoverable(ex) = true
isrecoverable(ex::CapturedException) = isrecoverable(ex.ex)
isrecoverable(ex::ConnectError) = isrecoverable(ex.error)
isrecoverable(ex::Sockets.DNSError) = (ex.code == Base.UV_EAI_AGAIN)
isrecoverable(ex::StatusError) = retryable(ex.status)

From that it seems to me, that default to true was deliberate -- without it no common retriable exceptions would've been retried.

Drvi · 2023-09-14T17:46:11Z

So my understanding is that we want to go back full circle, carrying on some of the tweaks we did to the original isrecoverable methods:

isrecoverable(e) = false
isrecoverable(e::Union{Base.EOFError, Base.IOError, MbedTLS.MbedException, OpenSSL.OpenSSLError}) = true
isrecoverable(e::ArgumentError) = e.msg == "stream is closed or unusable"
+isrecoverable(ex::CapturedException) = isrecoverable(ex.ex) # forgot this one yesterday
-isrecoverable(e::Sockets.DNSError) = true
+isrecoverable(e::Sockets.DNSError) = (ex.code == Base.UV_EAI_AGAIN)
-isrecoverable(e::ConnectError) = true
+isrecoverable(e::ConnectError) = isrecoverable(e.error)
isrecoverable(e::RequestError) = isrecoverable(e.error)
isrecoverable(e::StatusError) = retryable(e.status)

quinnj · 2023-09-14T22:13:32Z

Just a quick chime-in that the discussion all sounds great and correct. We used to have default false w/ a bunch of opt-ins, then in a large refactor where we had a bunch of cases that weren't being correctly retried, we changed the default to true. I think it's fine to go back to default false, but I would just want to make sure we're handling cases like CompositeException, CapturedException, etc. i.e. we probably want to hook into ExceptionUnwrapping first and then pass the root error to a dispatchable check for whether it's recoverable. We may also want to consider making isrecoverable part of the HTTP public API so that if, for example, CloudBase does have a certain kind of error that should be retried, it can opt-in a custom exception type to being retryable.

Drvi · 2023-09-15T08:15:02Z

Thanks Jacob! Good points, I totally forgot about composite exceptions and ExceptionUnwrapping.

We may also want to consider making isrecoverable part of the HTTP public

I agree we want to do that eventually. Earlier I suggested that we might want to formalize what the layer can expect to be retried. Maybe via making it the layers job to wrap retryable exception in an RetryableException or something. This way the layer can signal that some ArgumentErrors are okay to retry and some are not. An alternative is to force the layers to always use custom types for exceptions and overload isrecoverable for them.

codecov-commenter · 2023-09-15T10:54:17Z

Codecov Report

Merging #1110 (2997de9) into master (cefae7b) will increase coverage by 0.14%.
The diff coverage is 80.00%.

@@            Coverage Diff             @@
##           master    #1110      +/-   ##
==========================================
+ Coverage   82.34%   82.49%   +0.14%     
==========================================
  Files          32       32              
  Lines        3042     3044       +2     
==========================================
+ Hits         2505     2511       +6     
+ Misses        537      533       -4

Files Changed	Coverage Δ
src/clientlayers/RetryRequest.jl	`85.41% <80.00%> (+2.80%)`	⬆️

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

quinnj · 2023-09-15T16:29:34Z

src/clientlayers/RetryRequest.jl

@@ -92,9 +93,11 @@ end

 function no_retry_reason(ex, req)
    buf = IOBuffer()
+    unwrapped_ex = unwrap_exception(ex)


I don't think we need this unwrap_exception if we're doing it via dispatch in isrecoverable; so I think we remove it here and keep it above.

Oh, but this is just for the no_retry_reson, so maybe we just need this for printing, which is fine.

Exactly yeah, this is for printing. Thanks!

Don't retry on internal exceptions

5460108

Drvi requested review from quinnj and nickrobinson251 September 14, 2023 12:30

nickrobinson251 reviewed Sep 14, 2023

View reviewed changes

Make isrecoverable opt-in again

f976002

Drvi requested a review from NHDaly September 15, 2023 10:56

Fix typo

2997de9

Drvi requested a review from nickrobinson251 September 15, 2023 11:16

quinnj reviewed Sep 15, 2023

View reviewed changes

quinnj approved these changes Sep 15, 2023

View reviewed changes

Drvi merged commit bda4ef2 into master Sep 15, 2023

Drvi deleted the td-dont-retry-internal-errors branch September 15, 2023 16:38

Drvi mentioned this pull request Sep 18, 2023

Change logs to use current_exceptions_to_string() instead of exception= pattern #1092

Merged

Drvi mentioned this pull request Sep 27, 2023

Retry on TimeoutException, as we did in HTTP v1.9 #1114

Open

Don't retry on internal exceptions #1110

Don't retry on internal exceptions #1110

Uh oh!

Conversation

Drvi commented Sep 14, 2023

Uh oh!

nickrobinson251 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickrobinson251 Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Drvi commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Drvi commented Sep 14, 2023

Uh oh!

Drvi commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quinnj commented Sep 14, 2023

Uh oh!

Drvi commented Sep 15, 2023

Uh oh!

codecov-commenter commented Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nickrobinson251 Sep 14, 2023 •

edited

Loading

Drvi commented Sep 14, 2023 •

edited

Loading

Drvi commented Sep 14, 2023 •

edited

Loading

codecov-commenter commented Sep 15, 2023 •

edited

Loading