Skip to content

A complete automation framework for deploying a highly available PostgreSQL database on Azure Kubernetes Service with Premium v2 storage, CloudNativePG operator, and PgBouncer connection pooling.

License

Notifications You must be signed in to change notification settings

jonathan-vella/azure-postgresql-ha-aks-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Azure PostgreSQL HA on AKS Workshop

Version: v1.0.0 | License: MIT | Status: Lab & PoC Ready

A complete automation framework for deploying a highly available PostgreSQL database on Azure Kubernetes Service with Premium v2 storage, CloudNativePG operator, and PgBouncer connection pooling.

⚠️ IMPORTANT: Lab and Proof-of-Concept Use Only
This code is provided strictly for lab environments and proof-of-concept purposes only. It is not intended for production use. Additional hardening, security reviews, compliance validation, and operational procedures are required before considering any production deployment.

Version Status License PostgreSQL AKS CNPG Azure HA Performance


πŸ—οΈ Architecture Overview

PostgreSQL HA on AKS with PgBouncer

Architecture Diagram

graph TB
    subgraph "Azure Subscription"
        subgraph "Virtual Network (10.0.0.0/8)"
            subgraph "AKS Cluster (1.32)"
                subgraph "System Node Pool (2x D2s_v5)"
                    CNPG["CNPG Operator<br/>(cnpg-system)"]
                    INF["Prometheus<br/>Monitoring"]
                end
                
                subgraph "Connection Pooling Layer"
                    PGB1["PgBouncer Pod 1<br/>Transaction Mode<br/>10K Connections"]
                    PGB2["PgBouncer Pod 2<br/>Transaction Mode<br/>10K Connections"]
                    PGB3["PgBouncer Pod 3<br/>Transaction Mode<br/>10K Connections"]
                end
                
                subgraph "PostgreSQL Node Pool (3x E8as_v6)"
                    PG1["PostgreSQL Primary<br/>Instance 1<br/>200GB Data + WAL<br/>40K IOPS"]
                    PG2["PostgreSQL Sync Replica<br/>Instance 2 (Quorum)<br/>200GB Data + WAL<br/>40K IOPS"]
                    PG3["PostgreSQL Async Replica<br/>Instance 3<br/>200GB Data + WAL<br/>40K IOPS"]
                end
                
                subgraph "Kubernetes Services"
                    SVC_POOL_RW["Service: pg-primary-pooler-rw<br/>(PgBouncer Read-Write)<br/>Port 5432"]
                    SVC_POOL_RO["Service: pg-primary-pooler-ro<br/>(PgBouncer Read-Only)<br/>Port 5432"]
                    SVC_RW["Service: pg-primary-rw<br/>(Direct Read-Write)<br/>Port 5432"]
                    SVC_RO["Service: pg-primary-ro<br/>(Direct Read-Only)<br/>Port 5432"]
                end
            end
            
            SVC_POOL_RW --> PGB1 & PGB2 & PGB3
            SVC_POOL_RO --> PGB1 & PGB2 & PGB3
            PGB1 & PGB2 & PGB3 -.->|Connection Pool| PG1
            PGB1 & PGB2 & PGB3 -.->|Connection Pool| PG2 & PG3
            PG1 ===|Sync Replication<br/>RPO=0| PG2
            PG1 ---|Async Replication| PG3
            SVC_RW --> PG1
            SVC_RO --> PG2 & PG3
        end
        
        subgraph "Storage & Backup"
            SA["Azure Storage Account<br/>(ZRS)<br/>Blob Backups"]
            LA["Log Analytics<br/>Workspace"]
        end
        
        subgraph "Monitoring"
            GRAF["Azure Managed Grafana<br/>Instance"]
            AMW["Azure Monitor<br/>Workspace"]
        end
        
        subgraph "Network Security"
            NSG["Network Security Group<br/>- Kubernetes API: 443<br/>- PostgreSQL: 5432"]
            MI["Managed Identity<br/>(Workload Identity)"]
        end
        
        PG1 & PG2 & PG3 -->|WAL Archive + Backups| SA
        CNPG & PG1 & PG2 & PG3 -->|Metrics| AMW
        AMW --> GRAF
        MI -->|Auth to Storage| SA
        NSG -.->|Security Rules| PG1 & PG2 & PG3
    end
    
    style PG1 fill:#336791,stroke:#2d5a7b,color:#fff
    style PG2 fill:#336791,stroke:#2d5a7b,color:#fff
    style PG3 fill:#336791,stroke:#2d5a7b,color:#fff
    style PGB1 fill:#47a8bd,stroke:#358a9c,color:#fff
    style PGB2 fill:#47a8bd,stroke:#358a9c,color:#fff
    style PGB3 fill:#47a8bd,stroke:#358a9c,color:#fff
    style SA fill:#0078d4,stroke:#0062a3,color:#fff
    style GRAF fill:#ff9830,stroke:#d67f1a,color:#fff
    style AMW fill:#0078d4,stroke:#0062a3,color:#fff
    style MI fill:#7fba00,stroke:#6d9b00,color:#fff
    style NSG fill:#ff6b6b,stroke:#e63946,color:#fff
Loading

✨ Key Features

πŸ”§ Infrastructure & Deployment

  • Full Automation: Pure Azure CLI scripts following Microsoft reference implementation
  • Separate Node Pools: 2 system nodes (D4s_v5) + 3 user nodes (E8as_v6) for workload isolation
  • Zone Redundancy: Deployment across 3 Azure Availability Zones
  • Premium Storage: Premium SSD v2 with 40K IOPS & 1,250 MB/s per disk (200 GiB)
  • DevContainer Ready: Pre-configured environment with all tools installed

πŸ›‘οΈ High Availability & Reliability

  • 3-Node Cluster: 1 primary + 1 quorum sync replica + 1 async replica
  • Automatic Failover: <10 second RTO with zero data loss (RPO = 0)
  • Data Durability: Synchronous replication with remote_apply guarantee
  • Connection Pooling: 3 PgBouncer instances handling 10,000+ concurrent connections
  • Health Monitoring: Automated health checks with self-healing capabilities

πŸ“Š Performance & Scalability

  • Target Throughput: Optimized for 8,000-10,000 TPS
  • Dynamic Resources: PostgreSQL parameters auto-calculate from memory allocation
  • Efficient Pooling: Transaction-mode pooling for optimal connection management
  • Load Balancing: Automatic read distribution across replicas

πŸ” Security & Compliance

  • Workload Identity: Federated credentials (no secrets in pods)
  • Authentication: SCRAM-SHA-256 password encryption
  • Network Security: NSGs, private networking, NAT Gateway
  • Encryption: At-rest and in-transit encryption
  • RBAC: Kubernetes role-based access control

πŸ“ˆ Observability & Operations

  • Grafana Dashboards: Pre-built dashboard with 9 monitoring panels
  • Prometheus Metrics: Real-time cluster health and performance metrics
  • Azure Monitor: Centralized log aggregation and alerting
  • CloudNativePG: 1.27.1 operator for automated lifecycle management

πŸ’Ύ Backup & Recovery

  • Automated Backups: WAL archiving + base backups to Azure Blob Storage
  • 7-Day Retention: Configurable backup retention policies
  • Point-in-Time Recovery: PITR capability via WAL archives
  • Geo-Redundancy: Optional GRS for disaster recovery

πŸš€ Quick Start

Option A: Use DevContainer (Recommended, Tested, Validated) 🐳

All tools pre-installed in isolated container with auto-generated environment:

# Requirements: Docker Desktop + VS Code Remote - Containers extension
# 1. Open project in VS Code
# 2. Ctrl+Shift+P -> "Dev Containers: Reopen in Container"
# 3. Wait for build (2-5 min first time)
# 4. .env file auto-generated with unique resource names
# 5. Tools ready: az, kubectl, helm, jq, bc, psql, netcat, kubectl-cnpg (v1.27.1)

Key Features:

  • Auto-generates .env with unique suffix and resource names
  • CNPG kubectl plugin v1.27.1 pre-installed
  • PostgreSQL client (psql) for testing
  • Network tools (netcat) for connectivity testing
  • Calculator (bc) for pgbench metrics

See .devcontainer/README.md for detailed setup.

Option B: Local Installation (Not tested)

Prerequisites:

  • Azure CLI (v2.56+), kubectl (v1.21+), Helm (v3.0+), jq, OpenSSL
  • Azure subscription with Owner or User Access Administrator role
  • Region with Premium v2 disk support

1️⃣ Configure

Option A: DevContainer (Recommended)

# .env is auto-generated when container starts
# Contains unique suffix and all resource names
source .env

# Verify configuration
echo "Resource Group: $RESOURCE_GROUP_NAME"
echo "AKS Cluster: $AKS_PRIMARY_CLUSTER_NAME"

# Optional: Regenerate with new suffix
./scripts/regenerate-env.sh

Option B: Manual Setup

# Clone repository
git clone <repo-url>
cd azure-postgresql-ha-aks-workshop

# Review and customize environment variables
code config/environment-variables.sh

2️⃣ Deploy

Using DevContainer:

# Load auto-generated environment variables
source .env

# Deploy all components (8 automated steps)
./scripts/deploy-all.sh

Using Manual Setup:

# Load environment variables into current shell session
# This makes all configuration values available to deployment scripts
source config/environment-variables.sh

# Deploy all components (8 automated steps)
./scripts/deploy-all.sh

What does this do? The source command loads all configuration variables (like resource names, regions, VM sizes) into your current terminal session. This allows the deployment scripts to access these values without hardcoding them. In DevContainer, .env is auto-generated with unique resource names; otherwise, use config/environment-variables.sh.

3️⃣ Verify

# Get cluster credentials
az aks get-credentials --resource-group <rg-name> --name <cluster-name>

# Check status
kubectl cnpg status pg-primary -n cnpg-database

# View pods
kubectl get pods -n cnpg-database -l cnpg.io/cluster=pg-primary

4️⃣ Validate Deployment

# Run comprehensive cluster validation (in-cluster Kubernetes Job)
./scripts/07a-run-cluster-validation.sh

What gets validated:

  • βœ… Primary and replica connectivity (100% pass rate)
  • βœ… PgBouncer connection pooling (3 instances)
  • βœ… Data write operations and replication consistency
  • βœ… Read-only service routing to replicas
  • βœ… Replication health and accessibility
  • βœ… Multi-connection concurrency testing
  • ⚑ Completes in ~7 seconds (in-cluster execution)

5️⃣ Connect

# Option 1: Connect via PgBouncer (Recommended for Applications)
kubectl port-forward svc/pg-primary-pooler-rw 5432:5432 -n cnpg-database &
psql -h localhost -U app -d appdb

# Option 2: Direct connection to PostgreSQL
kubectl port-forward svc/pg-primary-rw 5432:5432 -n cnpg-database &
psql -h localhost -U app -d appdb

Why use PgBouncer?

  • Handles 10,000+ concurrent connections efficiently
  • Reduces PostgreSQL connection overhead
  • Transaction-level pooling for optimal performance
  • Automatic load distribution across replicas

πŸ“‹ Documentation

Document Description
πŸ“– SETUP_COMPLETE.md πŸ‘ˆ START HERE - Complete setup guide
⚑ QUICK_REFERENCE.md Command cheat sheet
πŸ’° COST_ESTIMATION.md Hourly/monthly cost breakdown (~$2,873/month)
πŸ“Š GRAFANA_DASHBOARD_GUIDE.md Dashboard usage and metrics
πŸ”„ FAILOVER_TESTING.md High availability testing
🎯 CNPG_BEST_PRACTICES.md CloudNativePG 1.27 production best practices

βš™οΈ Configuration

.env                           - Auto-generated (DevContainer only, gitignored)
    - Unique suffix for resource names
    - All Azure resource names pre-configured
    - Generated at devcontainer startup

config/
└── environment-variables.sh   - Bash environment configuration template
    - Resource names with random suffix
    - AKS settings (version, VM sizes)
    - Storage configuration (IOPS, throughput)
    - PostgreSQL parameters
    - Auto-detect public IP for firewall

πŸš€ Deployment Scripts

scripts/
β”œβ”€β”€ deploy-all.sh                       - Master orchestration script (8 steps with logging)
β”œβ”€β”€ regenerate-env.sh                   - ⭐ Regenerate .env with new suffix (DevContainer)
β”œβ”€β”€ setup-prerequisites.sh              - ⭐ Install required tools (non-DevContainer)
β”œβ”€β”€ 02-create-infrastructure.sh         - Creates Azure resources (RG, AKS, Storage, Identity, Bastion, NAT Gateway, Container Insights)
β”œβ”€β”€ 03-configure-workload-identity.sh   - Sets up federated credentials
β”œβ”€β”€ 04-deploy-cnpg-operator.sh          - Installs CloudNativePG operator v1.27.1 via Helm
β”œβ”€β”€ 04a-install-barman-cloud-plugin.sh  - Installs Barman Cloud Plugin v0.8.0 for backup/restore
β”œβ”€β”€ 05-deploy-postgresql-cluster.sh     - Deploys PostgreSQL cluster + PgBouncer pooler + PodMonitor
β”œβ”€β”€ 06-configure-monitoring.sh          - Configures Azure Managed Grafana
β”œβ”€β”€ 06a-configure-azure-monitor-prometheus.sh - Configures Azure Monitor Managed Prometheus
β”œβ”€β”€ 06b-import-grafana-dashboard.sh     - ⭐ Import Grafana dashboard automatically
β”œβ”€β”€ 07-display-connection-info.sh       - Displays connection endpoints and credentials
└── 07a-run-cluster-validation.sh       - ⭐ In-cluster validation (100% pass rate, 7s execution)

βš™οΈ Kubernetes Reference

kubernetes/
└── postgresql-cluster.yaml - Reference manifest (NOT used in deployment)
    - See scripts/05-deploy-postgresql-cluster.sh for actual deployment
    - Configuration values loaded from environment variables

οΏ½ Repository Structure

πŸ“¦ azure-postgresql-ha-aks-workshop/
β”œβ”€β”€ πŸ“„ README.md                        # Main project documentation
β”œβ”€β”€ πŸ“„ 00_START_HERE.md                 # Quick start guide
β”œβ”€β”€ πŸ“„ CONTRIBUTING.md                  # Contribution guidelines
β”œβ”€β”€ πŸ“„ LICENSE                          # MIT License
β”‚
β”œβ”€β”€ πŸ“‚ config/                          # Configuration files
β”‚   └── environment-variables.sh        # Bash environment config
β”‚
β”œβ”€β”€ πŸ“‚ scripts/                         # Deployment automation
β”‚   β”œβ”€β”€ deploy-all.sh                   # Master orchestration (8 steps)
β”‚   β”œβ”€β”€ 02-create-infrastructure.sh     # Azure resources + Container Insights
β”‚   β”œβ”€β”€ 03-configure-workload-identity.sh
β”‚   β”œβ”€β”€ 04-deploy-cnpg-operator.sh
β”‚   β”œβ”€β”€ 04a-install-barman-cloud-plugin.sh
β”‚   β”œβ”€β”€ 05-deploy-postgresql-cluster.sh
β”‚   β”œβ”€β”€ 06-configure-monitoring.sh      # Grafana
β”‚   β”œβ”€β”€ 06a-configure-azure-monitor-prometheus.sh # Azure Monitor
β”‚   β”œβ”€β”€ 07-display-connection-info.sh
β”‚   └── 07a-run-cluster-validation.sh   # In-cluster validation
β”‚
β”œβ”€β”€ πŸ“‚ kubernetes/                      # K8s manifests
β”‚   β”œβ”€β”€ postgresql-cluster.yaml         # Reference manifest
β”‚   └── cluster-validation-job.yaml     # In-cluster validation Job
β”‚
β”œβ”€β”€ πŸ“‚ grafana/                         # Grafana dashboards
β”‚   └── grafana-cnpg-ha-dashboard.json  # PostgreSQL HA dashboard
β”‚
β”œβ”€β”€ πŸ“‚ docs/                            # Comprehensive documentation
β”‚   β”œβ”€β”€ README.md                       # Full technical guide
β”‚   β”œβ”€β”€ SETUP_COMPLETE.md               # πŸ‘ˆ Start here
β”‚   β”œβ”€β”€ QUICK_REFERENCE.md              # Command cheat sheet
β”‚   β”œβ”€β”€ COST_ESTIMATION.md              # Budget planning
β”‚   β”œβ”€β”€ PRE_DEPLOYMENT_CHECKLIST.md     # Pre-flight checks
β”‚   β”œβ”€β”€ AZURE_MONITORING_SETUP.md       # Monitoring setup
β”‚   β”œβ”€β”€ GRAFANA_DASHBOARD_GUIDE.md      # Dashboard usage
β”‚   β”œβ”€β”€ IMPORT_DASHBOARD_NOW.md         # Dashboard import
β”‚   β”œβ”€β”€ FAILOVER_TESTING.md             # HA testing
β”‚   └── VM_SETUP_GUIDE.md               # Load test VM
β”‚
└── πŸ“‚ .github/
    └── copilot-instructions.md         # AI assistant context

πŸŽ“ How to Use This Project

Phase 1: Understanding (10 mins)

  1. Read docs/SETUP_COMPLETE.md - Overview and prerequisites
  2. Review docs/QUICK_REFERENCE.md - Command reference
  3. Check docs/COST_ESTIMATION.md - Budget planning
  4. Skim docs/README.md - Full capabilities

Phase 2: Preparation (15 mins)

  1. Verify prerequisites installed (az, kubectl, helm, jq)
  2. Update config/environment-variables.sh
  3. Change PostgreSQL password in environment variables
  4. Verify region support for Premium v2

Phase 3: Deployment (20 mins)

  1. Load environment: source config/environment-variables.sh
  2. Run ./scripts/deploy-all.sh
  3. Monitor deployment progress (7 automated steps)
  4. Verify cluster health

Phase 4: Validation (10 mins)

  1. Check pods are running
  2. Test PostgreSQL connection
  3. Verify backups to storage
  4. Access Grafana dashboard
  5. Run pgbench test: ./scripts/08-test-pgbench.sh

Phase 5: Operation (Ongoing)

  1. Monitor cluster metrics
  2. Test backup/restore
  3. Scale as needed
  4. Apply updates

οΏ½ Connection Pooling with PgBouncer

Architecture

The deployment includes 3 PgBouncer instances for high-availability connection pooling:

Component Configuration
Instances 3 pods with pod anti-affinity (different nodes)
Mode Transaction pooling (optimal for OLTP workloads)
Max Connections 10,000 client connections per instance
Pool Size 25 PostgreSQL connections per user/database
Total Capacity 30,000 concurrent client connections across all instances

Services

# PgBouncer services (Recommended)
pg-primary-pooler-rw    # Read-write via connection pool
pg-primary-pooler-ro    # Read-only via connection pool

# Direct PostgreSQL services
pg-primary-rw           # Direct read-write (no pooling)
pg-primary-ro           # Direct read-only (no pooling)

When to Use PgBouncer

βœ… Use PgBouncer for:

  • Applications with many short-lived connections
  • Microservices architectures
  • Serverless workloads (Azure Functions, AWS Lambda)
  • Connection-heavy applications (10K+ connections)
  • High-availability workloads requiring connection efficiency

⚠️ Direct connections for:

  • Long-running analytical queries
  • Database administration tasks
  • Schema migrations
  • Backup/restore operations

Connection Examples

# Via PgBouncer (Applications)
psql "host=pg-primary-pooler-rw.cnpg-database.svc.cluster.local port=5432 dbname=appdb user=app"

# Direct (Admin tasks)
psql "host=pg-primary-rw.cnpg-database.svc.cluster.local port=5432 dbname=appdb user=app"

οΏ½πŸ“Š What Gets Deployed

Azure Resources

  • βœ… Resource Group
  • βœ… Virtual Network (10.0.0.0/8)
  • βœ… Network Security Group
  • βœ… AKS Cluster (1.32)
    • System node pool: 2 x Standard_D2s_v5
    • Postgres node pool: 3 x Standard_E8as_v6
  • βœ… Managed Identity (Workload Identity)
  • βœ… Storage Account (ZRS, Standard_V2)
  • βœ… Log Analytics Workspace
  • βœ… Managed Grafana Instance

Kubernetes Resources

  • βœ… CNPG Operator (cnpg-system namespace)
  • βœ… PostgreSQL Cluster (cnpg-database namespace)
    • 3 PostgreSQL instances (48 GiB RAM, 6 vCPU each)
    • 3 PgBouncer pooler instances (transaction mode, 10K max connections)
    • 200GB data storage per instance
    • Premium SSD v2 disks (40,000 IOPS, 1,250 MB/s per disk)
    • Expected performance: 8,000-10,000 TPS sustained
  • βœ… StorageClass (managed-csi-premium-v2)
  • βœ… Services (pooler read-write, pooler read-only, direct read-write, direct read-only)
  • βœ… ConfigMaps & Secrets
  • βœ… PersistentVolumeClaims

Features Enabled

  • βœ… High Availability (automatic failover)
  • βœ… Zone Redundancy (across 3 AZs)
  • βœ… Workload Identity (secure auth)
  • βœ… Backup to Azure Storage
  • βœ… Point-in-Time Recovery (7 days)
  • βœ… WAL compression (lz4)
  • βœ… Monitoring (Prometheus + Grafana)
  • βœ… Health checks (automatic)

πŸ” Security Features

Feature Implementation
Authentication Workload Identity + SCRAM-SHA-256
Network NSGs + Network Policies (Cilium)
Secrets No hardcoded secrets in pods
RBAC Kubernetes + Azure RBAC enabled
Encryption Storage encrypted at rest
Backups No public access, encrypted
Isolation Dedicated namespaces

πŸ’Ύ Storage Options

Premium SSD v2 (Default - Optimized for High Performance)

  • IOPS: 40,000 per disk (configurable 3,100-80,000)
  • Throughput: 1,250 MB/s per disk (configurable 125-1,200 MB/s)
  • Capacity: 200 GiB per instance
  • Benefits: Excellent price-performance for high-TPS workloads (8-10K TPS)
  • Regions: swedencentral, westeurope, eastus, canadacentral, etc.

Premium SSD (Alternative)

  • IOPS: Fixed per disk size (lower than Premium v2)
  • Throughput: Fixed per disk size (lower than Premium v2)
  • Benefits: Widely available, proven performance
  • Tradeoff: Less cost-efficient and lower IOPS than Premium v2

Local NVMe (Ultra-High Performance - Future Migration)

  • IOPS: 400K+ per disk (Standard_L8s_v3)
  • Throughput: 2,000+ MB/s
  • Benefits: Sub-millisecond latency, 50K+ TPS capability
  • Tradeoff: Requires Azure Container Storage, higher cost
  • Use Case: Extreme transactional workloads (see Step 5 documentation)

πŸ”§ Configuration Overview

Key Parameters to Adjust

In config/environment-variables.sh:

# Azure settings
PRIMARY_CLUSTER_REGION="swedencentral"
AKS_CLUSTER_VERSION="1.32"

# VM sizes (Standard_E8as_v6: 8 vCPU, 64 GiB RAM, AMD EPYC 9004 @ 3.7 GHz)
SYSTEM_NODE_POOL_VMSKU="Standard_D2s_v5"
USER_NODE_POOL_VMSKU="Standard_E8as_v6"

# Storage (Premium SSD v2 - Optimized for 10K TPS)
DISK_IOPS="40000"              # Max Premium SSD v2 IOPS
DISK_THROUGHPUT="1200"         # Max Premium SSD v2 throughput (MB/s)
PG_STORAGE_SIZE="200Gi"        # Increased for better performance

# PostgreSQL (Optimized for Standard_E8as_v6)
PG_DATABASE_NAME="appdb"
PG_DATABASE_USER="app"
PG_DATABASE_PASSWORD="SecurePassword123!"  # Change this!
PG_MEMORY="48Gi"               # 75% of 64 GiB available on E8as_v6
PG_CPU="6"                     # 75% of 8 vCPUs available on E8as_v6

# CNPG version (Operator 1.27.1)
CNPG_VERSION="0.26.1"

All configuration is centralized in environment variables - no need to edit multiple files.


πŸ“ˆ Monitoring & Observability

Azure Monitor

  • Application Insights integration
  • Container Insights (AKS logs)
  • Performance metrics

Prometheus + Grafana

  • PostgreSQL metrics via PodMonitor
  • Cluster health dashboards
  • Performance visualization
  • Alert capabilities

Key Metrics

# PostgreSQL Metrics
pg_up                                   # Database health
pg_stat_replication_lag_bytes            # Replication lag
pg_database_size_bytes                   # Database size
pg_wal_archive_status                    # Backup status

# PgBouncer Metrics
pgbouncer_pools_cl_active               # Active client connections
pgbouncer_pools_sv_active               # Active server connections
pgbouncer_pools_maxwait                 # Connection pool wait time
pgbouncer_pools_cl_waiting              # Queued client connections

# Infrastructure Metrics
node_memory_MemAvailable_bytes           # Node memory

🚨 Critical Prerequisites

Tools

  • Azure CLI (v2.56+)
  • kubectl (v1.21+)
  • Helm (v3.0+)
  • jq (v1.5+)
  • OpenSSL (v3.3+)
  • Krew + CNPG plugin

Azure Requirements

  • Subscription with appropriate quota
  • Permissions: Owner or User Access Administrator
  • Region with Premium v2 support

Before Deployment

  • Change default passwords
  • Verify region support
  • Check subscription quota
  • Update managed identity references
  • Review cost implications

βœ… Deployment Checklist

Before deployment:

  • Prerequisites installed
  • Configuration reviewed
  • Passwords changed
  • Region selected
  • Quota verified

After deployment:

  • Cluster created
  • Pods running (3 PostgreSQL + 3 PgBouncer instances)
  • Storage provisioned
  • Backups to storage
  • Grafana accessible
  • Connection successful (both direct and pooled)

πŸ“ž Support & Troubleshooting

Quick Diagnostics

# Check operator
kubectl logs -n cnpg-system deployment/cnpg-cloudnative-pg

# Check cluster status
kubectl cnpg status pg-primary -n cnpg-database

# Check all pods (PostgreSQL + PgBouncer)
kubectl get pods -n cnpg-database

# Check PgBouncer logs
kubectl logs -n cnpg-database -l cnpg.io/poolerName=pg-primary-pooler

# Check storage
kubectl get pvc -n cnpg-database

# Check backups
az storage blob list --account-name <account> --container-name backups

# Test performance
./scripts/08-test-pgbench.sh

Common Issues

  1. Pods stuck in Init: Check PVC binding and storage quota
  2. WAL archiving fails: Verify managed identity permissions
  3. Operator not deploying: Check Helm repository and CRDs
  4. Premium v2 unavailable: Check region support

See docs/README.md for detailed troubleshooting.


πŸ“š Learning Path

  1. Understand the basics

    • Read: docs/SETUP_COMPLETE.md
    • Review: docs/README.md
  2. Explore configuration

    • Edit: config/deployment-config.json
    • Review: kubernetes/postgresql-cluster.yaml
  3. Deploy to Azure

    • Run: scripts/deploy-postgresql-ha.sh
    • Monitor: kubectl commands
  4. Test operations

    • Connect to database
    • Create backups
    • Test failover
    • Monitor metrics
  5. Advanced topics

    • Scale cluster
    • Update PostgreSQL
    • Performance tuning
    • Backup management

🎯 Success Criteria

Your deployment is successful when:

  • βœ… 3 PostgreSQL pods running
  • βœ… 3 PgBouncer pooler pods running
  • βœ… Primary pod shows "Primary" status
  • βœ… Replica pods show "Standby (sync)"
  • βœ… WAL archiving shows "OK"
  • βœ… Backups present in storage
  • βœ… Can connect via psql (both direct and pooled)
  • βœ… Grafana dashboard accessible
  • βœ… All PVCs bound and sized correctly
  • βœ… PgBouncer metrics showing active connections

πŸ§ͺ Failover Testing

After deployment, validate high availability with comprehensive failover tests:

Quick Start

# Navigate to failover testing
cd scripts/failover-testing

# Set PostgreSQL password
export PGPASSWORD=$(kubectl get secret pg-primary-app -n cnpg-database \
  -o jsonpath='{.data.password}' | base64 -d)

# Run recommended scenario (PgBouncer + Simulated Failure)
./scenario-2b-aks-pooler-simulated.sh

Test Scenarios

Automated AKS Pod Scenarios (ready to run):

  • scenario-1a-aks-direct-manual.sh - Direct PostgreSQL + Manual failover
  • scenario-1b-aks-direct-simulated.sh - Direct PostgreSQL + Simulated failure
  • scenario-2a-aks-pooler-manual.sh - PgBouncer + Manual failover ⭐
  • scenario-2b-aks-pooler-simulated.sh - PgBouncer + Simulated failure ⭐ Recommended

Azure VM External Client Scenarios (requires VM setup):

  • See docs/VM_SETUP_GUIDE.md for Azure VM configuration
  • See scripts/failover-testing/VM_SCENARIOS_REFERENCE.md for external client testing

What Gets Tested

  • βœ… RPO = 0 validation (zero data loss with synchronous replication)
  • βœ… RTO < 10s measurement (recovery time objective)
  • βœ… Connection resilience (Direct vs PgBouncer comparison)
  • βœ… Data consistency (pre/post-failover transaction verification)
  • βœ… Client reconnection (automatic vs manual)
  • βœ… Performance impact (TPS and latency during failover)

Expected Results

  • Target TPS: 4,000-8,000 sustained (payment gateway workload)
  • Failover Duration: <10 seconds (automatic promotion)
  • Data Loss: Zero (RPO=0 with synchronous replication)
  • PgBouncer Advantage: Transparent reconnection, <1% error rate
  • Direct Connection: 5-10% error rate during failover window

Documentation


πŸ”— Important Links


πŸ“ Version Information

Project Version: v1.0.0 (Semantic Versioning)
Release Date: October 2025
AKS Version: 1.32
Kubernetes Version: 1.32
CNPG Operator: 1.27.1
PostgreSQL: 18.0
Status: βœ… Lab & PoC Ready


Ready to deploy? Start with docs/SETUP_COMPLETE.md πŸš€

About

A complete automation framework for deploying a highly available PostgreSQL database on Azure Kubernetes Service with Premium v2 storage, CloudNativePG operator, and PgBouncer connection pooling.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •