This document describes the implementation of a Pilot Light disaster recovery (DR) strategy using AWS services. The solution maintains a minimal version of the environment in a secondary region that can be rapidly scaled up during a disaster, focusing on essential services (EC2, RDS, S3, and Lambda). The solution is designed to be cost-effective and scalable, with minimal resources running in the Disaster Recovery (DR) region.
- VPC
- Public and private subnets
- Security groups
- Internet Gateway and VPC endpoints for S3
- EC2 instances in Auto Scaling Group (ASG)
- RDS MySQL database (primary)
- S3 bucket with cross-region replication enabled
- Lambda functions with configured event sources
- Application Load Balancer
- EventBridge rule for scheduled AMI creation
- EventBridge rule for automated SSM parameter synchronization
- VPC with public and private subnets
- Auto Scaling Group (scaled to 0 instances)
- RDS Read Replica
- S3 bucket (replication target)
- Lambda functions (disabled state)
- Application Load Balancer (pre-configured)
- EventBridge rule for automated failover and failback
- EventBridge rule for automated SSM parameter synchronization (DR backup, disabled)
- Lambda and Step functions for failover and failback operations
- EventBridge rules for automated failover and failback
- SSM Parameter Store for configuration
- IAM roles and policies
- Database: Cross-region read replica maintains synchronization with primary RDS
- S3: Cross-Region Replication (CRR) with versioning enabled
- AMI: Automated creation and copying to DR region
- Configuration: SSM Parameter Store synchronization
- DR region runs minimal resources:
- ASG scaled to 0 instances
- RDS read replica (necessary cost)
- Pre-provisioned but inactive load balancer
- S3 bucket for replication
- Disabled Lambda functions
Triggered by EventBridge rule or manual intervention:
-
Database Promotion
- DR Lambda promotes RDS read replica to primary
- Original primary becomes unavailable
-
Compute Resources
- Latest AMI is selected in DR region
- ASG launch template is updated
- ASG scaled up to handle production load
-
Lambda Functions
- Enable Lambda functions in DR region
- Configure event sources
Can be triggered through AWS Console or CLI:
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:<dr-region>:<account-id>:stateMachine:dev-dr-failover \
--name <execution-name>
- RTO (Recovery Time Objective): ~10-15 minutes
- RDS promotion: ~5-10 minutes
- ASG scaling: ~5 minutes
- RPO (Recovery Point Objective):
- Database: seconds to minutes (async replication)
- S3: minutes (async replication)
Triggered by EventBridge rule or manual intervention:
-
Database Restoration
- Create snapshot of DR database
- Copy snapshot to primary region
- Restore primary database
- Create new read replica in DR region
-
Compute Resources
- Update primary region ASG with latest AMI
- Scale up primary ASG
- Scale down DR ASG
Can be triggered through AWS Console or CLI:
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:<dr-region>:<account-id>:stateMachine:dev-dr-failback \
--name <execution-name>
- Monthly failover test to DR region
- Quarterly full DR simulation
- Bi-annual failback testing
- Regular validation of AMI copying and replication
- Testing of Lambda functions
- Review and update of IAM policies
- Regular updates to DR documentation
- Full production environment costs
- Load Balancer: Hourly cost
- ASG: Minimal costs (0 instances)
- S3: Replication and storage costs
- RDS Read Replica: Full instance cost
- Lambda: Minimal costs (disabled functions)
- EventBridge: Minimal costs
- Step Functions: Minimal costs
- Encryption at rest for RDS and S3
- VPC security groups for access control
- IAM roles with minimal required permissions
- Private subnets for application and database
- Security groups for fine-grained access control
- VPC endpoints for AWS services
-
Asynchronous Replication
- Potential for data loss during failover and failback
- RPO dependent on network latency
-
DNS Propagation
- May affect actual RTO
- Consider AWS Global Accelerator/Route 53 health checks
-
Cost vs. Recovery Time
- Faster recovery requires more running resources
- Balance based on business requirements