-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Labels
Description
Problem Description
The pgrst_db_pool_available metric can drift into negative values during network instability or connection pool disruptions. We observed values as low as -233 with pgrst_db_pool_max: 80, which is clearly incorrect.
Root Cause
The race condition is in src/PostgREST/Metrics.hs:35-41:
(HasqlPoolObs (SQL.ConnectionObservation _ status)) -> case status of
SQL.ReadyForUseConnectionStatus -> do
incGauge poolAvailable
SQL.InUseConnectionStatus -> do
decGauge poolAvailable
SQL.TerminatedConnectionStatus _ -> do
decGauge poolAvailable
SQL.ConnectingConnectionStatus -> pure ()The incGauge and decGauge operations from prometheus-client are not atomic. During network instability:
- Multiple connections transition states simultaneously
decGaugeoperations can occur before correspondingincGaugeoperations- Connections that terminate before becoming "ready" decrement without ever incrementing
- The gauge drifts negative over time
Impact
- Metrics are unreliable - monitoring/alerting based on
pool_availableproduces false negatives - No actual pool impact - the underlying pool works correctly; only the metric is corrupted
- Persists until restart - the counter never self-corrects; requires PostgREST restart
Environment
- PostgREST version: v12.2.12
prometheus-clientconstraint:>= 1.1.1 && < 1.2.0hasql-poolconstraint:>= 1.0.1 && < 1.1- Trigger: Network instability during Google GCE incident
Suggested Fix
The gauge updates need to be atomic. Options include:
- Use STM - Wrap the gauge in an
TVarfor atomic updates - Use atomic operations - If
prometheus-clientsupports atomic inc/dec - Track absolute count - Calculate available from (total - in_use) instead of incrementing
- Add mutex/lock - Protect gauge updates with a lock (less ideal for performance)
Workaround
Restart PostgREST to reset the counter to correct values.
taimoorzaeem and steve-chavez