Skip to content

Race condition in pool_available metric causes negative values during network instability #4622

@DavraYoung

Description

@DavraYoung

Problem Description

The pgrst_db_pool_available metric can drift into negative values during network instability or connection pool disruptions. We observed values as low as -233 with pgrst_db_pool_max: 80, which is clearly incorrect.

Root Cause

The race condition is in src/PostgREST/Metrics.hs:35-41:

(HasqlPoolObs (SQL.ConnectionObservation _ status)) -> case status of
  SQL.ReadyForUseConnectionStatus  -> do
    incGauge poolAvailable
  SQL.InUseConnectionStatus        -> do
    decGauge poolAvailable
  SQL.TerminatedConnectionStatus  _ -> do
    decGauge poolAvailable
  SQL.ConnectingConnectionStatus -> pure ()

The incGauge and decGauge operations from prometheus-client are not atomic. During network instability:

  1. Multiple connections transition states simultaneously
  2. decGauge operations can occur before corresponding incGauge operations
  3. Connections that terminate before becoming "ready" decrement without ever incrementing
  4. The gauge drifts negative over time

Impact

  • Metrics are unreliable - monitoring/alerting based on pool_available produces false negatives
  • No actual pool impact - the underlying pool works correctly; only the metric is corrupted
  • Persists until restart - the counter never self-corrects; requires PostgREST restart

Environment

  • PostgREST version: v12.2.12
  • prometheus-client constraint: >= 1.1.1 && < 1.2.0
  • hasql-pool constraint: >= 1.0.1 && < 1.1
  • Trigger: Network instability during Google GCE incident

Suggested Fix

The gauge updates need to be atomic. Options include:

  1. Use STM - Wrap the gauge in an TVar for atomic updates
  2. Use atomic operations - If prometheus-client supports atomic inc/dec
  3. Track absolute count - Calculate available from (total - in_use) instead of incrementing
  4. Add mutex/lock - Protect gauge updates with a lock (less ideal for performance)

Workaround

Restart PostgREST to reset the counter to correct values.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions