Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always retry a request even if the sender returns a non-nil error #464

Merged
merged 2 commits into from
Oct 7, 2019

Conversation

jhendrixMSFT
Copy link
Member

@jhendrixMSFT jhendrixMSFT commented Aug 21, 2019

Minor CHANGELOG fix.

Thank you for your contribution to Go-AutoRest! We will triage and review it as soon as we can.

As part of submitting, please make sure you can make the following assertions:

  • I've tested my changes, adding unit tests if applicable.
  • I've added Apache 2.0 Headers to the top of any new source files.
  • I'm submitting this PR to the dev branch, except in the case of urgent bug fixes warranting their own release.
  • If I'm targeting master, I've updated CHANGELOG.md to address the changes I'm making.

@jhendrixMSFT jhendrixMSFT self-assigned this Aug 21, 2019
@jhendrixMSFT
Copy link
Member Author

Fixes #450

Copy link
Member

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM, but I'm not sure if these error codes are truly inclusive of all of the error cases you might run into across OS's.

Also, you might want to take a look at what 1.13 is doing with https://golang.org/doc/go1.13#error_wrapping. It seems like you are close to the Unwrap / As pattern, but not quite there.


func TestIsTemporaryNetworkErrorTrue(t *testing.T) {
if !IsTemporaryNetworkError(someTempError{}) {
t.Fatal("expected someTempError to be a temporary network error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really want to fail fast :)

kahing added a commit to kahing/goofys that referenced this pull request Sep 5, 2019
@kahing
Copy link

kahing commented Sep 12, 2019

I've been running this on a branch in travis and just saw another transient failure: url.Err: *net.OpError: dial tcp 52.239.160.162:443: connect: connection refused

Maybe instead of having a list of errno to retry, it should keep a list of errno not to retry?

@jhendrixMSFT
Copy link
Member Author

It's an interesting idea, then the question becomes what's the list of error codes we shouldn't retry? EHOSTDOWN and EHOSTUNREACH seem like good candidates although couldn't these too be transient in the face of network connectivity problems? Seems like getting all the conditions right might be tricky (same as the changes in this PR). I'm not against the idea, just wondering which one is easier to maintain (having a list of errno to not retry kinda feels like trying to prove a negative). Of course we could just always retry on error which is what we originally used to do; it's the simplest solution but runs the risk of retrying on non-transient failures.

@kahing
Copy link

kahing commented Sep 17, 2019

retrying on non-transient failure is better than not retrying on transient failures

@kahing
Copy link

kahing commented Sep 17, 2019

also, seems like with this PR I still see connection reset failures: https://travis-ci.org/kahing/goofys/jobs/585257461

if detailedError, ok := err.(autorest.DetailedError); ok {
	if urlErr, ok := detailedError.Original.(*url.Error); ok {
		adl2Log.Errorf("url.Err: %T: %v %v %v", urlErr.Err, urlErr.Err, urlErr.Temporary(), urlErr.Timeout())

that code produced this log line:

2019/09/15 17:42:09.184467 adlv2.ERROR url.Err: *net.OpError: read tcp 10.20.1.217:42936->52.239.160.162:443: read: connection reset by peer false false

@mbrancato
Copy link
Contributor

I agree on retrying is better than not.

It's an interesting idea, then the question becomes what's the list of error codes we shouldn't retry? EHOSTDOWN and EHOSTUNREACH seem like good candidates although couldn't these too be transient in the face of network connectivity problems? Seems like getting all the conditions right might be tricky (same as the changes in this PR). I'm not against the idea, just wondering which one is easier to maintain (having a list of errno to not retry kinda feels like trying to prove a negative). Of course we could just always retry on error which is what we originally used to do; it's the simplest solution but runs the risk of retrying on non-transient failures.

I think this is addressed in the retry guidance:
https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#error-handling

My read of 'timeout' is to include all connection failures and a complete failure after 5 recommended attempts in the backoff process.

@jhendrixMSFT jhendrixMSFT changed the title Treat connection time-outs and resets as temporary Always retry a request even if the sender returns a non-nil error Sep 27, 2019
@jhendrixMSFT
Copy link
Member Author

@kahing @mbrancato I've reworked this to always retry failed requests.

@kahing
Copy link

kahing commented Sep 28, 2019

still got some errors:

... value autorest.DetailedError = autorest.DetailedError{Original:(*errors.errorString)(0xc0038d7930), PackageType:"storagedatalake.adl2PathClient", Method:"List", StatusCode:200, Message:"Failure responding to request", ServiceError:[]uint8(nil), Response:(*http.Response)(0xc000325b90)} ("storagedatalake.adl2PathClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'read tcp 10.20.0.122:53872->52.239.193.98:443: read: connection reset by peer'")

and

2019/09/28 03:14:41.831574 adlv2.ERROR azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to [secure]goofys-test-ac93v4y6q13janv8?resource=filesystem: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post https://login.microsoftonline.com/[secure]/oauth2/token?api-version=1.0: read tcp 10.20.0.99:55356->40.126.5.98:443: read: connection reset by peer'

@jhendrixMSFT
Copy link
Member Author

Thanks for the info. For the first case we don't, at present, have any retry logic when reading a response body. The retry logic for this PR is only for calling the REST API. Can you please open a new issue to track adding retries for reading responses? Please note that this is likely a significant design change so I don't know how fast a fix will be forthcoming.
For the second case, looking at ServicePrincipalToken.refreshInternal() I don't believe we actually retry the request if it fails (only when authenticating via IMDS). I will update this PR to retry that case.

@kahing
Copy link

kahing commented Oct 3, 2019

pass rate is better but still get #470 quite often. This is definitely a step forward though

@jhendrixMSFT
Copy link
Member Author

CC @tombuildsstuff @katbyte

@jhendrixMSFT jhendrixMSFT merged commit d9a171c into Azure:master Oct 7, 2019
@jhendrixMSFT jhendrixMSFT deleted the transient branch October 7, 2019 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants