Skip to content

Add support for pg_stat_replication.(write_lag|flush_lag|replay_lag) #1007

Open
@spootze

Description

@spootze

Add support for pg_stat_replication.(write_lag|flush_lag|replay_lag)

Use case. Why is this important?

For simple primary:replica setups it would be convenient to be able to monitor standby lag in seconds from the primary. pg_stat_replication already includes this information on postgres>=10, but the exporter does not parse it. AFAIK, workarounds would include

  • monitoring byte offset from pg_stat_replication.write_lsn|flush_lsn|replay_lsn), but this does not quite capture issues where the replica has not replayed a business critical change however small in bytes.
  • monitoring delay on the replica, but this would require spinning up a separate monitor only to monitor a single value on the replica.

Notes

I'm assuming the exporter does not support parsing the interval data type, which is why these metrics are marked as DISCARD here. I wonder if the following, explicit approach would be appropriate in this case:

diff --git a/cmd/postgres_exporter/queries.go b/cmd/postgres_exporter/queries.go
index fa0b5c2..e28d7b4 100644
--- a/cmd/postgres_exporter/queries.go
+++ b/cmd/postgres_exporter/queries.go
@@ -53,7 +53,10 @@ var queryOverrides = map[string][]OverrideQuery{
 			SELECT *,
 				(case pg_is_in_recovery() when 't' then null else pg_current_wal_lsn() end) AS pg_current_wal_lsn,
 				(case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), pg_lsn('0/0'))::float end) AS pg_current_wal_lsn_bytes,
-				(case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)::float end) AS pg_wal_lsn_diff
+				(case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)::float end) AS pg_wal_lsn_diff,
+				(case pg_is_in_recovery() when 't' then null else  extract(epoch from write_lag) end) as write_lag_seconds,
+				(case pg_is_in_recovery() when 't' then null else  extract(epoch from flush_lag) end) as flush_lag_seconds,
+				(case pg_is_in_recovery() when 't' then null else  extract(epoch from replay_lag) end) as replay_lag_seconds
 			FROM pg_stat_replication
 			`,
 		},

That way the deviation from the official fields is not in conflict with the source data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions