WPB-19318: Ensure high-availability of the postgress cluster #807

sghosh23 · 2025-08-28T07:50:30Z

Please checkout the system design doc for more info: https://wearezeta.atlassian.net/wiki/spaces/CUSSOPS/pages/2088108112/PostgreSQL+High+Availability+System+Design

Change type

Fix
Feature
Documentation
Security / Upgrade

Basic information

THIS CHANGE REQUIRES A DEPLOYMENT PACKAGE RELEASE
THIS CHANGE REQUIRES A WIRE-DOCS RELEASE

Testing

I ran/applied the changes myself, in a test environment.
The CI job attached to this repo will test it for me.

Tracking

I added a new entry in an appropriate subdirectory of changelog.d
I mentioned this PR in Jira, OR I mentioned the Jira ticket in this PR.
I mentioned this PR in one of the issues attached to one of our repositories.

Knowledge Transfer

An Asciinema session is attached to the Jira ticket.

Motivation

Objective

Reason

Use case

offline/postgresql-cluster.md

nix/pkgs/wire-binaries.nix

ansible/templates/postgresql/simple_fence.sh.j2

ansible/templates/postgresql/postgresql_primary.conf.j2

ansible/templates/postgresql/detect_rouge_primary.sh.j2

ansible/postgresql-deploy.yml

…sive docs - Consolidate PostgreSQL configuration into single unified template - Fix split-brain detection script (correct 'rouge' to 'rogue' typo) - Add detailed HA features documentation with failover validation - Include monitoring & event system documentation - Add node_id and priority configuration parameters - Add official repmgr and PostgreSQL documentation references - Improve deployment commands and monitoring checks - Enhance split-brain protection with advanced features

- Remove duplicate HA features list from Key Concepts section - Remove duplicate monitoring system section from Configuration Options - Fix incorrect numbering in monitoring commands (5 → 8) - Consolidate monitoring information into single comprehensive section

- PostgreSQL cluster runs independently, not integrated with endpoint-manager - Explain postgres-endpoint-manager as separate component that monitors cluster externally - Emphasize independent operation of cluster vs endpoint management

mohitrajain · 2025-09-25T09:34:17Z

dumping status of services and logs

sudo systemctl status postgresql@17-main repmgrd@17-main detect-rogue-primary.timer -l --no-pager
● postgresql@17-main.service - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Wed 2025-09-24 12:45:01 UTC; 20h ago
   Main PID: 18744 (postgres)
      Tasks: 7 (limit: 4532)
     Memory: 107.8M
        CPU: 9min 54.399s
     CGroup: /system.slice/system-postgresql.slice/postgresql@17-main.service
             ├─18744 /usr/lib/postgresql/17/bin/postgres -D /var/lib/postgresql/17/main -c unix_socket_directories=/var/run/postgresql -c config_file=/etc/postgresql/17/main/postgresql.conf
             ├─18745 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18746 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18747 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18748 "postgres: startup recovering 00000001000000000000000B" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18749 "postgres: walreceiver streaming 0/B4E1D80" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             └─18835 "postgres: repmgr repmgr 10.1.1.6(48288) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

Sep 24 12:44:58 cassandra-warm-mackerel systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 24 12:45:01 cassandra-warm-mackerel systemd[1]: Started PostgreSQL Cluster 17-main.

● repmgrd@17-main.service - Repmgr failover daemon (instance 17-main)
     Loaded: loaded (/etc/systemd/system/repmgrd@.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2025-09-24 12:45:14 UTC; 20h ago
   Main PID: 18837 (repmgrd)
      Tasks: 1 (limit: 4532)
     Memory: 2.0M
        CPU: 19min 44.459s
     CGroup: /system.slice/system-repmgrd.slice/repmgrd@17-main.service
             └─18837 /usr/lib/postgresql/17/bin/repmgrd -f /etc/repmgr/17-main/repmgr.conf --daemonize

Sep 25 08:45:11 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 08:50:13 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 08:55:15 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:00:17 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:05:19 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:10:20 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:15:22 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:20:24 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:25:26 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:30:27 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state

● detect-rogue-primary.timer - PostgreSQL Split-Brain Detection Timer
     Loaded: loaded (/etc/systemd/system/detect-rogue-primary.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Wed 2025-09-24 12:45:29 UTC; 20h ago
    Trigger: Thu 2025-09-25 09:33:30 UTC; 12s left
   Triggers: ● detect-rogue-primary.service
       Docs: man:systemd.timer(5)

Sep 24 12:45:29 cassandra-warm-mackerel systemd[1]: Stopping PostgreSQL Split-Brain Detection Timer...
Sep 24 12:45:29 cassandra-warm-mackerel systemd[1]: Started PostgreSQL Split-Brain Detection Timer.

mohitrajain · 2025-09-25T09:41:33Z

sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show
 ID | Name        | Role    | Status    | Upstream    | Location | Priority | Timeline | Connection string                                                                
----+-------------+---------+-----------+-------------+----------+----------+----------+-----------------------------------------------------------------------------------
 1  | postgresql1 | primary | * running |             | default  | 150      | 1        | host=10.1.1.8 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 2  | postgresql2 | standby |   running | postgresql1 | default  | 100      | 1        | host=10.1.1.7 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 3  | postgresql3 | standby |   running | postgresql1 | default  | 50       | 1        | host=10.1.1.6 user=repmgr dbname=repmgr password=securepassword connect_timeout=2

mohitrajain · 2025-09-25T10:10:42Z

repmgr brings back postgresql service if it is found stopped

sudo systemctl stop postgresql@17-main.service 
root@cassandra-leading-eagle:~# sudo systemctl status postgresql@17-main.service 
○ postgresql@17-main.service - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: inactive (dead) since Thu 2025-09-25 10:08:27 UTC; 1s ago
    Process: 177096 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 17-main stop (code>
   Main PID: 24886 (code=exited, status=0/SUCCESS)
        CPU: 35min 18.968s

Sep 24 12:44:06 cassandra-leading-eagle systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 24 12:44:09 cassandra-leading-eagle systemd[1]: Started PostgreSQL Cluster 17-main.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: Stopping PostgreSQL Cluster 17-main...
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: postgresql@17-main.service: Deactivated successfully.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: Stopped PostgreSQL Cluster 17-main.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: postgresql@17-main.service: Consumed 35min 18.968s C>
root@cassandra-leading-eagle:~# sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show
 ID | Name        | Role    | Status    | Upstream    | Location | Priority | Timeline | Connection string                                                                
----+-------------+---------+-----------+-------------+----------+----------+----------+-----------------------------------------------------------------------------------
 1  | postgresql1 | primary | * running |             | default  | 150      | 1        | host=10.1.1.8 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 2  | postgresql2 | standby |   running | postgresql1 | default  | 100      | 1        | host=10.1.1.7 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 3  | postgresql3 | standby |   running | postgresql1 | default  | 50       | 1        | host=10.1.1.6 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
root@cassandra-leading-eagle:~# sudo systemctl status postgresql@17-main.service 
● postgresql@17-main.service - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2025-09-25 10:08:34 UTC; 19s ago
    Process: 177113 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 17-main start (code=exite>
   Main PID: 177118 (postgres)
      Tasks: 14 (limit: 4532)
     Memory: 39.6M
        CPU: 1.399s
     CGroup: /system.slice/system-postgresql.slice/postgresql@17-main.service
             ├─177118 /usr/lib/postgresql/17/bin/postgres -D /var/lib/postgresql/17/main -c unix_socket_>
             ├─177119 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177120 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177121 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
             ├─177123 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177124 "postgres: autovacuum launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ">
             ├─177125 "postgres: archiver " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
             ├─177126 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" ">
             ├─177127 "postgres: walsender repmgr 10.1.1.6(41268) streaming 0/C005EB0" "" "" "" "" "" "">
             ├─177129 "postgres: walsender repmgr 10.1.1.7(47974) streaming 0/C005EB0" "" "" "" "" "" "">
             ├─177160 "postgres: repmgr repmgr 10.1.1.6(34430) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177163 "postgres: repmgr repmgr 10.1.1.7(47666) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177165 "postgres: repmgr repmgr 10.1.1.8(50260) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             └─177224 "postgres: wire-server wire-server 10.1.1.15(7997) authentication" "" "" "" "" "" >

Sep 25 10:08:31 cassandra-leading-eagle systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 25 10:08:34 cassandra-leading-eagle systemd[1]: Started PostgreSQL Cluster 17-main.

2025-09-25T10:07:19+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:32+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:44+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:07:44+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:57+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:10+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:08:10+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:23+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:32+00:00 WARNING Failed to fetch next seq; reconnecting...
2025-09-25T10:08:33+00:00 ERROR CONNECT/INIT failed; retry in 1s
2025-09-25T10:08:34+00:00 ERROR CONNECT/INIT failed; retry in 2s
2025-09-25T10:08:37+00:00 INFO Connected OK (schema/table ensured) host=postgresql-external-rw port=5432 db=wire-server sslmode=prefer client_id=d2152319-da41-4da0-94d8-0634c2d56683
2025-09-25T10:08:42+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:08:42+00:00 INFO PROBE SUMMARY: 5 successful probes, 1 errors in last 5 seconds
2025-09-25T10:08:54+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:09:08+00:00 INFO HISTORY CHECK: no gaps in last 20 seq

mohitrajain · 2025-09-25T10:56:23Z

@sghosh23 we should leave a note in the postgresql documentation for maintenance of postgresql service, that it will require the repmgr to be stopped, otherwise, postgresql service can change during the maintenance.

mohitrajain · 2025-09-25T11:16:18Z

Can we please documentation on how to activate a postgresql service back which was masked by the detect-rogue-primary.timer? Also, lets mention the expected downtime for an application about 4.5 mins when failover happens.

sghosh23 · 2025-09-25T13:27:44Z

Can we please documentation on how to activate a postgresql service back which was masked by the detect-rogue-primary.timer? Also, lets mention the expected downtime for an application about 4.5 mins when failover happens.

As we already tested this part. I will add in the doc

mohitrajain

Based on my testing (logged on the ticket), it looks good to me.

sonarqubecloud · 2025-10-02T14:37:17Z

Quality Gate failed

Failed conditions
10 Security Hotspots

See analysis details on SonarQube Cloud

add pg failover automation with repmgr

68db610

sghosh23 mentioned this pull request Aug 28, 2025

WPB-19318: Ensure high-availability of the postgress cluster #801

Closed

12 tasks

sghosh23 added 2 commits August 28, 2025 12:19

Add a drop-IN to guard the priamry auto start

79eb0ec

add monitoring to detect split-brain and organize the plabooks

743a97d

sghosh23 marked this pull request as ready for review August 29, 2025 11:50

sghosh23 requested review from julialongtin and a team as code owners August 29, 2025 11:50

sghosh23 added 7 commits August 29, 2025 14:32

Update postgresql configuration and documentation

ab07fc3

Update the doc

e09ac6a

Merge branch 'master' into wpb-19318-pg-ha

57964fb

fix: typo on repmger.conf and update playbooks

ee0a531

debug: test deployment

9321edd

skip demo and mini build for now

5e57636

fix: set the right dns-resolver

759a7cf

mohitrajain requested changes Sep 16, 2025

View reviewed changes

sghosh23 added 7 commits September 18, 2025 18:06

Merge branch 'master' into wpb-19318-pg-ha

b519d48

docs: Clarify Kubernetes integration architecture

bc4b4c3

- PostgreSQL cluster runs independently, not integrated with endpoint-manager - Explain postgres-endpoint-manager as separate component that monitors cluster externally - Emphasize independent operation of cluster vs endpoint management

Optimize the doc

86a6e60

Optimize the doc to have a cleaner order of texts

10391bf

Update postgres document with full command paths

0d6347c

sghosh23 added 2 commits September 25, 2025 16:05

fix the repmgr reconnect time and adjust doc

e39dc15

update document

d69f358

mohitrajain approved these changes Sep 26, 2025

View reviewed changes

sghosh23 added 3 commits September 26, 2025 15:55

add postrgresql-external values file for the CI

06ad1e7

add demo values

7885fe5

Merge branch 'master' into wpb-19318-pg-ha

3bef7d0

sghosh23 force-pushed the wpb-19318-pg-ha branch from 4a7d57b to 3bef7d0 Compare October 2, 2025 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WPB-19318: Ensure high-availability of the postgress cluster #807

WPB-19318: Ensure high-availability of the postgress cluster #807

Uh oh!

sghosh23 commented Aug 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

mohitrajain commented Sep 25, 2025 •

edited

Loading

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

sghosh23 commented Sep 25, 2025

Uh oh!

mohitrajain left a comment

Uh oh!

sonarqubecloud bot commented Oct 2, 2025

Uh oh!

Uh oh!

WPB-19318: Ensure high-availability of the postgress cluster #807

Are you sure you want to change the base?

WPB-19318: Ensure high-availability of the postgress cluster #807

Uh oh!

Conversation

sghosh23 commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change type

Basic information

Testing

Tracking

Knowledge Transfer

Motivation

Objective

Reason

Use case

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

mohitrajain commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

sghosh23 commented Sep 25, 2025

Uh oh!

mohitrajain left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Oct 2, 2025

Quality Gate failed

Uh oh!

Uh oh!

sghosh23 commented Aug 28, 2025 •

edited

Loading

mohitrajain commented Sep 25, 2025 •

edited

Loading