Skip to content

Conversation

sghosh23
Copy link
Contributor

@sghosh23 sghosh23 commented Aug 28, 2025

Please checkout the system design doc for more info: https://wearezeta.atlassian.net/wiki/spaces/CUSSOPS/pages/2088108112/PostgreSQL+High+Availability+System+Design

Change type

  • Fix
  • Feature
  • Documentation
  • Security / Upgrade

Basic information

  • THIS CHANGE REQUIRES A DEPLOYMENT PACKAGE RELEASE
  • THIS CHANGE REQUIRES A WIRE-DOCS RELEASE

Testing

  • I ran/applied the changes myself, in a test environment.
  • The CI job attached to this repo will test it for me.

Tracking

  • I added a new entry in an appropriate subdirectory of changelog.d
  • I mentioned this PR in Jira, OR I mentioned the Jira ticket in this PR.
  • I mentioned this PR in one of the issues attached to one of our repositories.

Knowledge Transfer

  • An Asciinema session is attached to the Jira ticket.

Motivation

Objective

Reason

Use case

@sghosh23 sghosh23 marked this pull request as ready for review August 29, 2025 11:50
@sghosh23 sghosh23 requested review from julialongtin and a team as code owners August 29, 2025 11:50
…sive docs

- Consolidate PostgreSQL configuration into single unified template
- Fix split-brain detection script (correct 'rouge' to 'rogue' typo)
- Add detailed HA features documentation with failover validation
- Include monitoring & event system documentation
- Add node_id and priority configuration parameters
- Add official repmgr and PostgreSQL documentation references
- Improve deployment commands and monitoring checks
- Enhance split-brain protection with advanced features
- Remove duplicate HA features list from Key Concepts section
- Remove duplicate monitoring system section from Configuration Options
- Fix incorrect numbering in monitoring commands (5 → 8)
- Consolidate monitoring information into single comprehensive section
- PostgreSQL cluster runs independently, not integrated with endpoint-manager
- Explain postgres-endpoint-manager as separate component that monitors cluster externally
- Emphasize independent operation of cluster vs endpoint management
@mohitrajain
Copy link
Contributor

dumping status of services and logs

sudo systemctl status postgresql@17-main repmgrd@17-main detect-rogue-primary.timer -l --no-pager
● postgresql@17-main.service - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Wed 2025-09-24 12:45:01 UTC; 20h ago
   Main PID: 18744 (postgres)
      Tasks: 7 (limit: 4532)
     Memory: 107.8M
        CPU: 9min 54.399s
     CGroup: /system.slice/system-postgresql.slice/postgresql@17-main.service
             ├─18744 /usr/lib/postgresql/17/bin/postgres -D /var/lib/postgresql/17/main -c unix_socket_directories=/var/run/postgresql -c config_file=/etc/postgresql/17/main/postgresql.conf
             ├─18745 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18746 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18747 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18748 "postgres: startup recovering 00000001000000000000000B" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18749 "postgres: walreceiver streaming 0/B4E1D80" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             └─18835 "postgres: repmgr repmgr 10.1.1.6(48288) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

Sep 24 12:44:58 cassandra-warm-mackerel systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 24 12:45:01 cassandra-warm-mackerel systemd[1]: Started PostgreSQL Cluster 17-main.

● repmgrd@17-main.service - Repmgr failover daemon (instance 17-main)
     Loaded: loaded (/etc/systemd/system/repmgrd@.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2025-09-24 12:45:14 UTC; 20h ago
   Main PID: 18837 (repmgrd)
      Tasks: 1 (limit: 4532)
     Memory: 2.0M
        CPU: 19min 44.459s
     CGroup: /system.slice/system-repmgrd.slice/repmgrd@17-main.service
             └─18837 /usr/lib/postgresql/17/bin/repmgrd -f /etc/repmgr/17-main/repmgr.conf --daemonize

Sep 25 08:45:11 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 08:50:13 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 08:55:15 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:00:17 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:05:19 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:10:20 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:15:22 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:20:24 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:25:26 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:30:27 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state

● detect-rogue-primary.timer - PostgreSQL Split-Brain Detection Timer
     Loaded: loaded (/etc/systemd/system/detect-rogue-primary.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Wed 2025-09-24 12:45:29 UTC; 20h ago
    Trigger: Thu 2025-09-25 09:33:30 UTC; 12s left
   Triggers: ● detect-rogue-primary.service
       Docs: man:systemd.timer(5)

Sep 24 12:45:29 cassandra-warm-mackerel systemd[1]: Stopping PostgreSQL Split-Brain Detection Timer...
Sep 24 12:45:29 cassandra-warm-mackerel systemd[1]: Started PostgreSQL Split-Brain Detection Timer.

@mohitrajain
Copy link
Contributor

sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show
 ID | Name        | Role    | Status    | Upstream    | Location | Priority | Timeline | Connection string                                                                
----+-------------+---------+-----------+-------------+----------+----------+----------+-----------------------------------------------------------------------------------
 1  | postgresql1 | primary | * running |             | default  | 150      | 1        | host=10.1.1.8 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 2  | postgresql2 | standby |   running | postgresql1 | default  | 100      | 1        | host=10.1.1.7 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 3  | postgresql3 | standby |   running | postgresql1 | default  | 50       | 1        | host=10.1.1.6 user=repmgr dbname=repmgr password=securepassword connect_timeout=2

@mohitrajain
Copy link
Contributor

mohitrajain commented Sep 25, 2025

repmgr brings back postgresql service if it is found stopped

sudo systemctl stop postgresql@17-main.service 
root@cassandra-leading-eagle:~# sudo systemctl status postgresql@17-main.service 
○ postgresql@17-main.service - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: inactive (dead) since Thu 2025-09-25 10:08:27 UTC; 1s ago
    Process: 177096 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 17-main stop (code>
   Main PID: 24886 (code=exited, status=0/SUCCESS)
        CPU: 35min 18.968s

Sep 24 12:44:06 cassandra-leading-eagle systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 24 12:44:09 cassandra-leading-eagle systemd[1]: Started PostgreSQL Cluster 17-main.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: Stopping PostgreSQL Cluster 17-main...
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: postgresql@17-main.service: Deactivated successfully.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: Stopped PostgreSQL Cluster 17-main.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: postgresql@17-main.service: Consumed 35min 18.968s C>
root@cassandra-leading-eagle:~# sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show
 ID | Name        | Role    | Status    | Upstream    | Location | Priority | Timeline | Connection string                                                                
----+-------------+---------+-----------+-------------+----------+----------+----------+-----------------------------------------------------------------------------------
 1  | postgresql1 | primary | * running |             | default  | 150      | 1        | host=10.1.1.8 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 2  | postgresql2 | standby |   running | postgresql1 | default  | 100      | 1        | host=10.1.1.7 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 3  | postgresql3 | standby |   running | postgresql1 | default  | 50       | 1        | host=10.1.1.6 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
root@cassandra-leading-eagle:~# sudo systemctl status postgresql@17-main.service 
● postgresql@17-main.service - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2025-09-25 10:08:34 UTC; 19s ago
    Process: 177113 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 17-main start (code=exite>
   Main PID: 177118 (postgres)
      Tasks: 14 (limit: 4532)
     Memory: 39.6M
        CPU: 1.399s
     CGroup: /system.slice/system-postgresql.slice/postgresql@17-main.service
             ├─177118 /usr/lib/postgresql/17/bin/postgres -D /var/lib/postgresql/17/main -c unix_socket_>
             ├─177119 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177120 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177121 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
             ├─177123 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177124 "postgres: autovacuum launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ">
             ├─177125 "postgres: archiver " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
             ├─177126 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" ">
             ├─177127 "postgres: walsender repmgr 10.1.1.6(41268) streaming 0/C005EB0" "" "" "" "" "" "">
             ├─177129 "postgres: walsender repmgr 10.1.1.7(47974) streaming 0/C005EB0" "" "" "" "" "" "">
             ├─177160 "postgres: repmgr repmgr 10.1.1.6(34430) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177163 "postgres: repmgr repmgr 10.1.1.7(47666) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177165 "postgres: repmgr repmgr 10.1.1.8(50260) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             └─177224 "postgres: wire-server wire-server 10.1.1.15(7997) authentication" "" "" "" "" "" >

Sep 25 10:08:31 cassandra-leading-eagle systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 25 10:08:34 cassandra-leading-eagle systemd[1]: Started PostgreSQL Cluster 17-main.
2025-09-25T10:07:19+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:32+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:44+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:07:44+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:57+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:10+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:08:10+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:23+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:32+00:00 WARNING Failed to fetch next seq; reconnecting...
2025-09-25T10:08:33+00:00 ERROR CONNECT/INIT failed; retry in 1s
2025-09-25T10:08:34+00:00 ERROR CONNECT/INIT failed; retry in 2s
2025-09-25T10:08:37+00:00 INFO Connected OK (schema/table ensured) host=postgresql-external-rw port=5432 db=wire-server sslmode=prefer client_id=d2152319-da41-4da0-94d8-0634c2d56683
2025-09-25T10:08:42+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:08:42+00:00 INFO PROBE SUMMARY: 5 successful probes, 1 errors in last 5 seconds
2025-09-25T10:08:54+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:09:08+00:00 INFO HISTORY CHECK: no gaps in last 20 seq

@mohitrajain
Copy link
Contributor

@sghosh23 we should leave a note in the postgresql documentation for maintenance of postgresql service, that it will require the repmgr to be stopped, otherwise, postgresql service can change during the maintenance.

@mohitrajain
Copy link
Contributor

Can we please documentation on how to activate a postgresql service back which was masked by the detect-rogue-primary.timer? Also, lets mention the expected downtime for an application about 4.5 mins when failover happens.

@sghosh23
Copy link
Contributor Author

Can we please documentation on how to activate a postgresql service back which was masked by the detect-rogue-primary.timer? Also, lets mention the expected downtime for an application about 4.5 mins when failover happens.

As we already tested this part. I will add in the doc

Copy link
Contributor

@mohitrajain mohitrajain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my testing (logged on the ticket), it looks good to me.

Copy link

sonarqubecloud bot commented Oct 2, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
10 Security Hotspots

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants