Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't retry on internal exceptions #1110

Merged
merged 3 commits into from
Sep 15, 2023
Merged

Conversation

Drvi
Copy link
Collaborator

@Drvi Drvi commented Sep 14, 2023

Previously, we'd retry on exceptions happening inside layers, e.g. when cloudsign layer in CloudStore.jl decodes base64 encoded SAS tokens and if those are malformed we could get an ArgumentException

Copy link
Collaborator

@nickrobinson251 nickrobinson251 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why this is the right solution?

@@ -77,6 +77,9 @@ function retrylayer(handler)
end

isrecoverable(ex) = true
isrecoverable(ex::ArgumentError) = false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is isrecoverable(ex) = true just the wrong default?

e.g. what if CloudBase stops throwing a ArgumentError and decides to throw a MalformedTokenError?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why this is the right solution?

This seemed like the least disruptive fix for the problem we've seen. Maybe the right solution is to formalize the contract a layer needs to fulfill wrt throwing exceptions.

is isrecoverable(ex) = true just the wrong default?

I think Jacob planned a bigger refactor for retries, switching exception retries from opt-out to opt-in would make sense to me.

what if CloudBase stops throwing a ArgumentError and decides to throw a MalformedTokenError?

I understand, but I'm less worried about this possibility since this is the first instance of this issue we've seen and I don't expect the underlying packages to make such changes without us noticing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this bug (i.e. we'd retry on exceptions happening inside layers) shows that isrecoverable(ex) = true is the wrong default. What would be the disruption of flipping that to isrecoverable(ex) = false?

I don't expect the underlying packages to make such changes without us noticing

packages like CloudBase.jl are users on HTTP.jl (not dependencies), and we don't have much visibility on users, so i am worried about other users running into this same issue (just instead of via an ArgumentError via a MethodError or a AssertionError or a BoundError or ...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If @quinnj is not around, perhaps we can get a third view from @NHDaly (and/or whoever was running into this issue)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Blame shows this as the PR that set that as the default: #990

Looks like maybe the PR was trying to change to always retry if the request is retryable (has a retryable body or is idempotent)? But yeah... it does seem like a strange default...

Do you have any memory of why that PR was made, @nickrobinson251?

Copy link
Collaborator

@nickrobinson251 nickrobinson251 Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that PR was a fix for some oversight in the preceding rewrite: https://github.com/JuliaWeb/HTTP.jl/pull/974/files

I don't believe that rewrite PR intended to change any behaviour, only increase flexibility. The description (from the previous draft of the same thing #877) was

this PR proposes more configurability for retry logic in the hopes that users won't opt to roll their own incorrect retry logic.

and i notice that before that refactor we had

isrecoverable(e) = false 
isrecoverable(e::Union{Base.EOFError, Base.IOError, MbedTLS.MbedException, OpenSSL.OpenSSLError}) = true
isrecoverable(e::ArgumentError) = e.msg == "stream is closed or unusable"
isrecoverable(e::Sockets.DNSError) = true
isrecoverable(e::ConnectError) = true
isrecoverable(e::RequestError) = isrecoverable(e.error)
isrecoverable(e::StatusError) = retryable(e.status)

which makes me think the default of true was unintended

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amusingly, there is now a failing test:

@test HTTP.RetryRequest.isrecoverable(ErrorException(""))

so my attempt to make an improvement with as little behavior change as possible was doomed to fail from the beginning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the disruption of flipping that to isrecoverable(ex) = false?

The fact that we wouldn't retry because we relied on certain exceptions being considered recoverable implicitly. Even this current PR seem to have missed isrecoverable(e::ArgumentError) = e.msg == "stream is closed or unusable" which is unfortunate.

@Drvi
Copy link
Collaborator Author

Drvi commented Sep 14, 2023

Here are all mentions of isrecoverable, via git log -pG isrecoverable > isrecoverable.txt
isrecoverable.txt

Features an amazing ascii art for docs, where did it go? Would be great for extended help docstrings.

@Drvi
Copy link
Collaborator Author

Drvi commented Sep 14, 2023

Here is a short history of isrecoverable:

isrecoverable(e) = false
isrecoverable(e::Union{Base.EOFError, Base.IOError, MbedTLS.MbedException, OpenSSL.OpenSSLError}) = true
isrecoverable(e::ArgumentError) = e.msg == "stream is closed or unusable"
isrecoverable(e::Sockets.DNSError) = true
isrecoverable(e::ConnectError) = true
isrecoverable(e::RequestError) = isrecoverable(e.error)
isrecoverable(e::StatusError) = retryable(e.status)

->

# yes, nothing -- this effectively meant `isrecoverable(ex) = true`

->

isrecoverable(ex) = true
isrecoverable(ex::StatusError) = retryable(ex.status)

->

isrecoverable(ex) = true
isrecoverable(ex::ConnectError) = ex.error isa Sockets.DNSError && ex.error.code == EAI_AGAIN ? false : true
isrecoverable(ex::StatusError) = retryable(ex.status)

->

isrecoverable(ex) = true
isrecoverable(ex::CapturedException) = isrecoverable(ex.ex)
isrecoverable(ex::ConnectError) = ex.error isa Sockets.DNSError && ex.error.code == EAI_AGAIN ? false : true
isrecoverable(ex::StatusError) = retryable(ex.status)

->

isrecoverable(ex) = true
isrecoverable(ex::CapturedException) = isrecoverable(ex.ex)
isrecoverable(ex::ConnectError) = isrecoverable(ex.error)
isrecoverable(ex::Sockets.DNSError) = (ex.code == Base.UV_EAI_AGAIN)
isrecoverable(ex::StatusError) = retryable(ex.status)

From that it seems to me, that default to true was deliberate -- without it no common retriable exceptions would've been retried.

@Drvi
Copy link
Collaborator Author

Drvi commented Sep 14, 2023

So my understanding is that we want to go back full circle, carrying on some of the tweaks we did to the original isrecoverable methods:

isrecoverable(e) = false
isrecoverable(e::Union{Base.EOFError, Base.IOError, MbedTLS.MbedException, OpenSSL.OpenSSLError}) = true
isrecoverable(e::ArgumentError) = e.msg == "stream is closed or unusable"
+isrecoverable(ex::CapturedException) = isrecoverable(ex.ex) # forgot this one yesterday
-isrecoverable(e::Sockets.DNSError) = true
+isrecoverable(e::Sockets.DNSError) = (ex.code == Base.UV_EAI_AGAIN)
-isrecoverable(e::ConnectError) = true
+isrecoverable(e::ConnectError) = isrecoverable(e.error)
isrecoverable(e::RequestError) = isrecoverable(e.error)
isrecoverable(e::StatusError) = retryable(e.status)

@quinnj
Copy link
Member

quinnj commented Sep 14, 2023

Just a quick chime-in that the discussion all sounds great and correct. We used to have default false w/ a bunch of opt-ins, then in a large refactor where we had a bunch of cases that weren't being correctly retried, we changed the default to true. I think it's fine to go back to default false, but I would just want to make sure we're handling cases like CompositeException, CapturedException, etc. i.e. we probably want to hook into ExceptionUnwrapping first and then pass the root error to a dispatchable check for whether it's recoverable. We may also want to consider making isrecoverable part of the HTTP public API so that if, for example, CloudBase does have a certain kind of error that should be retried, it can opt-in a custom exception type to being retryable.

@Drvi
Copy link
Collaborator Author

Drvi commented Sep 15, 2023

Thanks Jacob! Good points, I totally forgot about composite exceptions and ExceptionUnwrapping.

We may also want to consider making isrecoverable part of the HTTP public

I agree we want to do that eventually. Earlier I suggested that we might want to formalize what the layer can expect to be retried. Maybe via making it the layers job to wrap retryable exception in an RetryableException or something. This way the layer can signal that some ArgumentErrors are okay to retry and some are not. An alternative is to force the layers to always use custom types for exceptions and overload isrecoverable for them.

@codecov-commenter
Copy link

codecov-commenter commented Sep 15, 2023

Codecov Report

Merging #1110 (2997de9) into master (cefae7b) will increase coverage by 0.14%.
The diff coverage is 80.00%.

@@            Coverage Diff             @@
##           master    #1110      +/-   ##
==========================================
+ Coverage   82.34%   82.49%   +0.14%     
==========================================
  Files          32       32              
  Lines        3042     3044       +2     
==========================================
+ Hits         2505     2511       +6     
+ Misses        537      533       -4     
Files Changed Coverage Δ
src/clientlayers/RetryRequest.jl 85.41% <80.00%> (+2.80%) ⬆️

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Drvi Drvi requested a review from NHDaly September 15, 2023 10:56
@@ -92,9 +93,11 @@ end

function no_retry_reason(ex, req)
buf = IOBuffer()
unwrapped_ex = unwrap_exception(ex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this unwrap_exception if we're doing it via dispatch in isrecoverable; so I think we remove it here and keep it above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, but this is just for the no_retry_reson, so maybe we just need this for printing, which is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly yeah, this is for printing. Thanks!

@Drvi Drvi merged commit bda4ef2 into master Sep 15, 2023
10 of 11 checks passed
@Drvi Drvi deleted the td-dont-retry-internal-errors branch September 15, 2023 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants