Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix consistent high timeout err rate caused by prepareStatement #892

Merged
merged 2 commits into from
Apr 18, 2017

Conversation

zhixinwen
Copy link
Contributor

Currently gocql will cache the timeout err caused by network instability when prepare a statement.

As a result, if gocql fails to prepare a statement due to network issue at the first time on a connection, the rest of the call to that statement on that connection would return "context deadline exceeded”.

It can result in high and consistent err rate and can only fixed by restart a host.

We have been suffering from this in production for a while.

@zhixinwen
Copy link
Contributor Author

zhixinwen commented Apr 17, 2017

A more detailed description of the problem can be found in this post.

conn.go Outdated
@@ -713,6 +713,9 @@ func (c *Conn) prepareStatement(ctx context.Context, stmt string, tracer Tracer)
if err != nil {
flight.err = err
flight.wg.Done()
if err.Error() == "context deadline exceeded" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this we should compare the value of context.ErrTimeout, but I think in this case we can just always remove the value on errors

@zhixinwen
Copy link
Contributor Author

@Zariel Fixed

@Zariel Zariel merged commit 09f0498 into apache:master Apr 18, 2017
@Zariel
Copy link
Contributor

Zariel commented Apr 18, 2017

Great thanks!

@rengawm
Copy link

rengawm commented Apr 18, 2017

lol, this has been plaguing us for months and I just spent the last day digging into this same bug. I was literally writing up an issue report to describe my findings and see what people thought a good fix would be... and I went to grab some links from the latest commit to the code which causes the problem, and noticed that it just got fixed. Thanks! :D

zhixinwen added a commit to zhixinwen/gocql that referenced this pull request Apr 19, 2017
…he#892)

* fix timeout bug

* remove cache no matter which err returns
mincai pushed a commit to uber/peloton that referenced this pull request Jan 7, 2019
Summary:
Upgrading to a version which includes the fix for
apache/cassandra-gocql-driver#892 which we have
been hitting quite often in production.

Resolves T1359093

Reviewers: min, #peloton

Reviewed By: min, #peloton

Subscribers: jenkins

Maniphest Tasks: T1359093

Differential Revision: https://code.uberinternal.com/D1337813
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants