Description
We recently updated some of our postgres client libraries and experience an almost total pool exhaustion roughly 2~3h are deploying the new version in production (never happened before) on 2 out of 20 DBs.
We updated the following libraries.
pg 7.4.3 ~> 7.6.1
pg-pool 2.0.3 ~> 2.0.4
pg-promise 7.5.4 ~> 8.5.2
Once the pool is slowly exhausting a majority of queries hitting an affected process are returned with "timeout exceeded when trying to connect" https://github.com/brianc/node-pg-pool/blob/v2.0.4/index.js#L178 and the percentage is increasing overtime and condition is persistent (~30min until revert)
I'm debugging this for multiple days now, but have a hard time identifying the exact root cause, so far suspected:
brianc/node-pg-pool#86, which means we generally queue more work now as no longer all pending queue items are dropped, but we rarely saw "timeout exceeded when trying to connect" errors before so this seems unlikely
#1503 some kind of race condition here as both affected DBs occasionally are hit by queries running into statement timeouts.
Any ideas, pointer, potential areas for races would be really appreciated.
//cc @vitaly-t I know that pg-promise is not part of pg
distribution, but you seemed really active here and I would prefer a single spot for discussion. Any insight would be appreciated.
We mainly (but not exclusively) use nested transactions via https://github.com/vitaly-t/pg-promise#transactions starting with SET LOCAL statement_timeout = 30000;