Always retry a request even if the sender returns a non-nil error #464

jhendrixMSFT · 2019-08-21T18:00:13Z

Minor CHANGELOG fix.

Thank you for your contribution to Go-AutoRest! We will triage and review it as soon as we can.

As part of submitting, please make sure you can make the following assertions:

I've tested my changes, adding unit tests if applicable.
I've added Apache 2.0 Headers to the top of any new source files.
I'm submitting this PR to the dev branch, except in the case of urgent bug fixes warranting their own release.
If I'm targeting master, I've updated CHANGELOG.md to address the changes I'm making.

jhendrixMSFT · 2019-08-21T18:00:29Z

Fixes #450

devigned

The code LGTM, but I'm not sure if these error codes are truly inclusive of all of the error cases you might run into across OS's.

Also, you might want to take a look at what 1.13 is doing with https://golang.org/doc/go1.13#error_wrapping. It seems like you are close to the Unwrap / As pattern, but not quite there.

devigned · 2019-09-05T00:19:09Z

internal/internal_test.go

+
+func TestIsTemporaryNetworkErrorTrue(t *testing.T) {
+	if !IsTemporaryNetworkError(someTempError{}) {
+		t.Fatal("expected someTempError to be a temporary network error")


You really want to fail fast :)

refs Azure/go-autorest#464

kahing · 2019-09-12T07:12:04Z

I've been running this on a branch in travis and just saw another transient failure: url.Err: *net.OpError: dial tcp 52.239.160.162:443: connect: connection refused

Maybe instead of having a list of errno to retry, it should keep a list of errno not to retry?

jhendrixMSFT · 2019-09-12T15:25:43Z

It's an interesting idea, then the question becomes what's the list of error codes we shouldn't retry? EHOSTDOWN and EHOSTUNREACH seem like good candidates although couldn't these too be transient in the face of network connectivity problems? Seems like getting all the conditions right might be tricky (same as the changes in this PR). I'm not against the idea, just wondering which one is easier to maintain (having a list of errno to not retry kinda feels like trying to prove a negative). Of course we could just always retry on error which is what we originally used to do; it's the simplest solution but runs the risk of retrying on non-transient failures.

kahing · 2019-09-17T04:52:35Z

retrying on non-transient failure is better than not retrying on transient failures

kahing · 2019-09-17T05:25:17Z

also, seems like with this PR I still see connection reset failures: https://travis-ci.org/kahing/goofys/jobs/585257461

if detailedError, ok := err.(autorest.DetailedError); ok {
	if urlErr, ok := detailedError.Original.(*url.Error); ok {
		adl2Log.Errorf("url.Err: %T: %v %v %v", urlErr.Err, urlErr.Err, urlErr.Temporary(), urlErr.Timeout())

that code produced this log line:

2019/09/15 17:42:09.184467 adlv2.ERROR url.Err: *net.OpError: read tcp 10.20.1.217:42936->52.239.160.162:443: read: connection reset by peer false false

mbrancato · 2019-09-21T02:58:49Z

I agree on retrying is better than not.

It's an interesting idea, then the question becomes what's the list of error codes we shouldn't retry? EHOSTDOWN and EHOSTUNREACH seem like good candidates although couldn't these too be transient in the face of network connectivity problems? Seems like getting all the conditions right might be tricky (same as the changes in this PR). I'm not against the idea, just wondering which one is easier to maintain (having a list of errno to not retry kinda feels like trying to prove a negative). Of course we could just always retry on error which is what we originally used to do; it's the simplest solution but runs the risk of retrying on non-transient failures.

I think this is addressed in the retry guidance:
https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#error-handling

My read of 'timeout' is to include all connection failures and a complete failure after 5 recommended attempts in the backoff process.

jhendrixMSFT · 2019-09-27T17:47:54Z

@kahing @mbrancato I've reworked this to always retry failed requests.

kahing · 2019-09-28T04:36:22Z

still got some errors:

... value autorest.DetailedError = autorest.DetailedError{Original:(*errors.errorString)(0xc0038d7930), PackageType:"storagedatalake.adl2PathClient", Method:"List", StatusCode:200, Message:"Failure responding to request", ServiceError:[]uint8(nil), Response:(*http.Response)(0xc000325b90)} ("storagedatalake.adl2PathClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'read tcp 10.20.0.122:53872->52.239.193.98:443: read: connection reset by peer'")

and

2019/09/28 03:14:41.831574 adlv2.ERROR azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to [secure]goofys-test-ac93v4y6q13janv8?resource=filesystem: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post https://login.microsoftonline.com/[secure]/oauth2/token?api-version=1.0: read tcp 10.20.0.99:55356->40.126.5.98:443: read: connection reset by peer'

jhendrixMSFT · 2019-09-30T15:15:30Z

Thanks for the info. For the first case we don't, at present, have any retry logic when reading a response body. The retry logic for this PR is only for calling the REST API. Can you please open a new issue to track adding retries for reading responses? Please note that this is likely a significant design change so I don't know how fast a fix will be forthcoming.
For the second case, looking at ServicePrincipalToken.refreshInternal() I don't believe we actually retry the request if it fails (only when authenticating via IMDS). I will update this PR to retry that case.

kahing · 2019-10-03T18:33:40Z

pass rate is better but still get #470 quite often. This is definitely a step forward though

jhendrixMSFT · 2019-10-03T18:40:31Z

CC @tombuildsstuff @katbyte

jhendrixMSFT requested a review from devigned August 21, 2019 18:00

jhendrixMSFT self-assigned this Aug 21, 2019

mbfrahry mentioned this pull request Sep 3, 2019

[WIP] azurerm_hdinsight_hadoop_cluster - Add edge node support hashicorp/terraform-provider-azurerm#4049

Closed

devigned approved these changes Sep 5, 2019

View reviewed changes

kahing added a commit to kahing/goofys that referenced this pull request Sep 5, 2019

try out the new connreset patch

434f1b0

refs Azure/go-autorest#464

mbrancato mentioned this pull request Sep 13, 2019

Add MSI for App Service / Functions support #463

Merged

4 tasks

Always retry a request even if the sender returns a non-nil error

f81aab2

jhendrixMSFT force-pushed the transient branch from 5a31b97 to f81aab2 Compare September 27, 2019 17:46

jhendrixMSFT changed the title ~~Treat connection time-outs and resets as temporary~~ Always retry a request even if the sender returns a non-nil error Sep 27, 2019

jhendrixMSFT mentioned this pull request Sep 30, 2019

ECONNRESET on read is not a transient network error #450

Closed

retry on sender error when refreshing auth token

cbacfcd

kahing mentioned this pull request Oct 1, 2019

retry when there's error reading response body #470

Closed

jhendrixMSFT merged commit d9a171c into Azure:master Oct 7, 2019

jhendrixMSFT deleted the transient branch October 7, 2019 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always retry a request even if the sender returns a non-nil error #464

Always retry a request even if the sender returns a non-nil error #464

jhendrixMSFT commented Aug 21, 2019 •

edited

Loading

jhendrixMSFT commented Aug 21, 2019

devigned left a comment

devigned Sep 5, 2019

kahing commented Sep 12, 2019

jhendrixMSFT commented Sep 12, 2019

kahing commented Sep 17, 2019

kahing commented Sep 17, 2019 •

edited

Loading

mbrancato commented Sep 21, 2019

jhendrixMSFT commented Sep 27, 2019

kahing commented Sep 28, 2019

jhendrixMSFT commented Sep 30, 2019

kahing commented Oct 3, 2019

jhendrixMSFT commented Oct 3, 2019

Always retry a request even if the sender returns a non-nil error #464

Always retry a request even if the sender returns a non-nil error #464

Conversation

jhendrixMSFT commented Aug 21, 2019 • edited Loading

jhendrixMSFT commented Aug 21, 2019

devigned left a comment

Choose a reason for hiding this comment

devigned Sep 5, 2019

Choose a reason for hiding this comment

kahing commented Sep 12, 2019

jhendrixMSFT commented Sep 12, 2019

kahing commented Sep 17, 2019

kahing commented Sep 17, 2019 • edited Loading

mbrancato commented Sep 21, 2019

jhendrixMSFT commented Sep 27, 2019

kahing commented Sep 28, 2019

jhendrixMSFT commented Sep 30, 2019

kahing commented Oct 3, 2019

jhendrixMSFT commented Oct 3, 2019

jhendrixMSFT commented Aug 21, 2019 •

edited

Loading

kahing commented Sep 17, 2019 •

edited

Loading