Add read_timeout and write_timeout #498
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I believe this addresses issue #450.
While doing some failure testing connected to an AWS HA RDS cluster, we noticed that our connections would hang on a net.(*conn).Read forever. The Postgres server was no longer responding so, statement_timeout had no effect. The stack trace for the affected goroutines looked like this:
Since the connect_timeout only applies to the initial connection before it is placed in the pool, and statement_timeout is a Postgres thing that doesn't apply when Postgres is down, we decided we wanted more timeout options. So, I added read_timeout and write_timeout (in milliseconds) to the connection string parameters and use them to set deadlines on reads and writes to the database connection. If a read or write returns before the deadline, then the deadline is cleared so the connection can be returned to the pool. If the deadline is reached an error is returned and eventually the connection gets marked as bad.
This fixed the issue we were seeing, but we did notice that the old, hung, established connections were still around in netstat. Looking at pq.(*conn).Close(), when the sql layer called to close the bad connection, the code would just return driver.ErrBadConn and not try to close the connection. I changed that to close the connection and return the result of the close call, similar to what is done at the bottom of that function. This seemed to do the trick and the hung connections were cleaned up.
Any feedback on these changes would be greatly appreciated. We have tested them in our fork as part of our applications failure testing and have had good results.