Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd-tester: add retry logic on retriving lease info #6788

Merged
merged 1 commit into from
Nov 3, 2016

Conversation

fanminshi
Copy link
Member

getting lease and keys info through raw rpcs rarely experience error such as EOF. This is considered as a failure and causes tester to clean up.
however, they are just transient problem with temporary connection issue which should not br considered as a testing failure. so we add retry logic in case of transient failure.

FIX #6754

@fanminshi
Copy link
Member Author

I suspect the EOF issue has nothing to do with our internal server code but it is a issue with the http library in which we are using. I think the simplest way to get around it is through retrying on client side. since connection errors are expected and client should take care of those.

cc @xiang90 @gyuho @heyitsanthony

@heyitsanthony
Copy link
Contributor

If it's errors from a disrupted connection, why not issue the RPC calls with FailFast=false?

@fanminshi
Copy link
Member Author

@heyitsanthony
getting lease info is using failfast=false. we are still experiencing EOF.

func (ls *leaseStresser) getLeaseByID(ctx context.Context, leaseID int64) (*pb.LeaseTimeToLiveResponse, error) {
    ltl := &pb.LeaseTimeToLiveRequest{ID: leaseID, Keys: true}
    return ls.lc.LeaseTimeToLive(ctx, ltl, grpc.FailFast(false))
}

@heyitsanthony
Copy link
Contributor

@fanminshi ok, that's a server-side issue. There needs to be an actual error code returned by the server that can be detected here instead of dropping the error on the floor.

@fanminshi
Copy link
Member Author

@heyitsanthony I dig into the lease code a bit. I think error is thrown from this line.
https://github.com/coreos/etcd/blob/master/lease/leasehttp/http.go#L187
I suspect that the http req failed due to EOF, and then that error is escalated to the client side.
I wonder how would the server side handles this kind of connection related error?

@fanminshi
Copy link
Member Author

The syndrome is similar to that of http://stackoverflow.com/questions/17714494/golang-http-request-results-in-eof-errors-when-making-multiple-requests-successi.

However, according to the post, that issue is patched in the go version 1.6. So we might experiencing something a bit different.

plog.Debugf("hasLeaseExpired %v resp %v error (%v)", leaseID, resp, err)
if rpctypes.Error(err) == rpctypes.ErrLeaseNotFound {
return true, nil
for ctx.Err() == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this for condition seems strange. how about checking the ctx.Err at the end of the loop instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i think we only need to retry on the keepalive path, not here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 I wonder which keepalive path are you talking about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry. my mistake. i was confused.

@fanminshi fanminshi force-pushed the lease_http_eof_fix branch 2 times, most recently from accdb3b to c0a7c00 Compare November 2, 2016 22:47
@@ -54,14 +54,14 @@ func (hc *hashChecker) checkRevAndHashes() (err error) {
for i := 0; i < retries; i++ {
revs, hashes, err = hc.hrg.getRevisionHash()
if err != nil {
plog.Warningf("retry %i. failed to retrieve revison and hash (%v)", i, err)
plog.Warningf("retry %d. failed to retrieve revison and hash (%v)", i, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fix this in another pr? thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. will do

return true, nil
// keep retrying until lease's state is known or ctx is being canceled
for ctx.Err() == nil {
resp, err := ls.getLeaseByID(ctx, leaseID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error is not expected, right? probably we should print out warning here if there is an error other than lease not found.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there shouldn't be error the first place. a warning makes sense.

@xiang90
Copy link
Contributor

xiang90 commented Nov 3, 2016

Let's clean this up. Then lgtm

// keep retrying until lease's state is known or ctx is being canceled
for ctx.Err() == nil {
resp, err := ls.getLeaseByID(ctx, leaseID)
plog.Warningf("hasLeaseExpired %v resp %v error (%v)", leaseID, resp, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this line after 270. we do not want to print it out if:

  1. there is no error
  2. the error is not found

getting lease and keys info through raw rpcs rarely experience error such as EOF. This is considered as a failure and causes tester to clean up.
however, they are just transient problem with temporary connection issue which should not be considered as a testing failure. so we add retry logic in case of transient failure.

FIX etcd-io#6754
@fanminshi
Copy link
Member Author

All fixed. PTAL @xiang90

@xiang90
Copy link
Contributor

xiang90 commented Nov 3, 2016

@fanminshi LGTM. Thanks a lot!

@fanminshi fanminshi merged commit b7ab5c6 into etcd-io:master Nov 3, 2016
@fanminshi fanminshi deleted the lease_http_eof_fix branch November 3, 2016 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants