Pilot Light Disaster Recovery Solution

Overview

This document describes the implementation of a Pilot Light disaster recovery (DR) strategy using AWS services. The solution maintains a minimal version of the environment in a secondary region that can be rapidly scaled up during a disaster, focusing on essential services (EC2, RDS, S3, and Lambda). The solution is designed to be cost-effective and scalable, with minimal resources running in the Disaster Recovery (DR) region.

Architecture Components

Primary Region (Production)

VPC
- Public and private subnets
- Security groups
- Internet Gateway and VPC endpoints for S3
EC2 instances in Auto Scaling Group (ASG)
RDS MySQL database (primary)
S3 bucket with cross-region replication enabled
Lambda functions with configured event sources
Application Load Balancer
EventBridge rule for scheduled AMI creation
EventBridge rule for automated SSM parameter synchronization

DR Region (Pilot Light)

VPC with public and private subnets
Auto Scaling Group (scaled to 0 instances)
RDS Read Replica
S3 bucket (replication target)
Lambda functions (disabled state)
Application Load Balancer (pre-configured)
EventBridge rule for automated failover and failback
EventBridge rule for automated SSM parameter synchronization (DR backup, disabled)

Automation Components

Lambda and Step functions for failover and failback operations
EventBridge rules for automated failover and failback
SSM Parameter Store for configuration
IAM roles and policies

Normal Operation

Data Synchronization

Database: Cross-region read replica maintains synchronization with primary RDS
S3: Cross-Region Replication (CRR) with versioning enabled
AMI: Automated creation and copying to DR region
Configuration: SSM Parameter Store synchronization

Cost Optimization

DR region runs minimal resources:
- ASG scaled to 0 instances
- RDS read replica (necessary cost)
- Pre-provisioned but inactive load balancer
- S3 bucket for replication
- Disabled Lambda functions

Disaster Recovery Process

Automated Failover

Triggered by EventBridge rule or manual intervention:

Database Promotion
- DR Lambda promotes RDS read replica to primary
- Original primary becomes unavailable
Compute Resources
- Latest AMI is selected in DR region
- ASG launch template is updated
- ASG scaled up to handle production load
Lambda Functions
- Enable Lambda functions in DR region
- Configure event sources

Manual Failover

Can be triggered through AWS Console or CLI:

aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:<dr-region>:<account-id>:stateMachine:dev-dr-failover \
  --name <execution-name>

Recovery Time

RTO (Recovery Time Objective): ~10-15 minutes
- RDS promotion: ~5-10 minutes
- ASG scaling: ~5 minutes
RPO (Recovery Point Objective):
- Database: seconds to minutes (async replication)
- S3: minutes (async replication)

Failback Process

Automated Failback

Triggered by EventBridge rule or manual intervention:

Database Restoration
- Create snapshot of DR database
- Copy snapshot to primary region
- Restore primary database
- Create new read replica in DR region
Compute Resources
- Update primary region ASG with latest AMI
- Scale up primary ASG
- Scale down DR ASG

Manual Failback

Can be triggered through AWS Console or CLI:

aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:<dr-region>:<account-id>:stateMachine:dev-dr-failback \
  --name <execution-name>

Testing and Maintenance

Regular Testing Schedule

Monthly failover test to DR region
Quarterly full DR simulation
Bi-annual failback testing

Maintenance Tasks

Regular validation of AMI copying and replication
Testing of Lambda functions
Review and update of IAM policies
Regular updates to DR documentation

Cost Considerations

Primary Region

Full production environment costs

DR Region

Load Balancer: Hourly cost
ASG: Minimal costs (0 instances)
S3: Replication and storage costs
RDS Read Replica: Full instance cost
Lambda: Minimal costs (disabled functions)
EventBridge: Minimal costs
Step Functions: Minimal costs

Security

Data Protection

Encryption at rest for RDS and S3
VPC security groups for access control
IAM roles with minimal required permissions

Network Security

Private subnets for application and database
Security groups for fine-grained access control
VPC endpoints for AWS services

Limitations and Considerations

Asynchronous Replication
- Potential for data loss during failover and failback
- RPO dependent on network latency
DNS Propagation
- May affect actual RTO
- Consider AWS Global Accelerator/Route 53 health checks
Cost vs. Recovery Time
- Faster recovery requires more running resources
- Balance based on business requirements

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
assets		assets
environments/dev		environments/dev
modules		modules
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pilot Light Disaster Recovery Solution

Overview

Architecture Components

Primary Region (Production)

DR Region (Pilot Light)

Automation Components

Normal Operation

Data Synchronization

Cost Optimization

Disaster Recovery Process

Automated Failover

Manual Failover

Recovery Time

Failback Process

Automated Failback

Manual Failback

Testing and Maintenance

Regular Testing Schedule

Maintenance Tasks

Cost Considerations

Primary Region

DR Region

Security

Data Protection

Network Security

Limitations and Considerations

About

Uh oh!

Releases

Packages

Languages

dansarpong/pilot-light-dr-project

Folders and files

Latest commit

History

Repository files navigation

Pilot Light Disaster Recovery Solution

Overview

Architecture Components

Primary Region (Production)

DR Region (Pilot Light)

Automation Components

Normal Operation

Data Synchronization

Cost Optimization

Disaster Recovery Process

Automated Failover

Manual Failover

Recovery Time

Failback Process

Automated Failback

Manual Failback

Testing and Maintenance

Regular Testing Schedule

Maintenance Tasks

Cost Considerations

Primary Region

DR Region

Security

Data Protection

Network Security

Limitations and Considerations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages