Skip to content

Commit f8f00f4

Browse files
Merge pull request #147 from scribd/COREINF-7967-data-warehouse-backup-post
COREINF-7967: Blog post - Building a Scalable Data Warehouse Backup System with AWS
2 parents ba9592a + 122d198 commit f8f00f4

File tree

2 files changed

+136
-0
lines changed

2 files changed

+136
-0
lines changed
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
---
2+
layout: post
3+
title: "Building a Scalable Data Warehouse Backup System with AWS"
4+
tags:
5+
- Data-warehouse
6+
- Terraform
7+
- AWS
8+
- Deltalake
9+
- Backup
10+
team: Core Infrastructure
11+
author: Oleh Motrunych
12+
---
13+
14+
We designed and implemented a scalable, cost-optimized backup system for S3 data warehouses that runs automatically on a monthly schedule. The system handles petabytes of data across multiple databases and uses a hybrid approach: AWS Lambda for small workloads and ECS Fargate for larger ones.
15+
At its core, the pipeline performs incremental backups — copying only new or changed parquet files while always preserving delta logs — dramatically reducing costs and runtime compared to full backups. Data is validated through S3 Inventory manifests, processed in parallel, and stored in Glacier for long-term retention.
16+
To avoid data loss and reduce storage costs, we also implemented a safe deletion workflow. Files older than 90 days, successfully backed up, and no longer present in the source are tagged for lifecycle-based cleanup instead of being deleted immediately.
17+
This approach ensures reliability, efficiency, and safety: backups scale seamlessly from small to massive datasets, compute resources are right-sized, and storage is continuously optimized.
18+
19+
![Open Data Warehouse Backup System diagram](..//files/backup_system_diagram.png)
20+
21+
---
22+
23+
### Our old approach had problems:
24+
25+
- Copying over the same files all the time – not effective from a cost perspective
26+
- Timeouts when manifests were too large for Lambda
27+
- Redundant backups inflating storage cost
28+
- Orphaned files piling up without clean deletion
29+
30+
---
31+
32+
### We needed a systematic, automated, and cost-effective way to:
33+
34+
- Run monthly backups across all databases
35+
- Scale from small jobs to massive datasets
36+
- Handle incremental changes instead of full copies
37+
- Safely clean up old data without risk of data loss
38+
39+
---
40+
41+
### The Design at a Glance
42+
43+
We built a hybrid backup architecture on AWS primitives:
44+
45+
- Step Functions – orchestrates the workflow
46+
- Lambda – lightweight jobs for small manifests
47+
- ECS Fargate – heavy jobs with no timeout constraints
48+
- S3 + S3 Batch Ops – storage and bulk copy/delete operations
49+
- EventBridge – monthly scheduler
50+
- Glue, CloudWatch, Secrets Manager – reporting, monitoring, secure keys
51+
- IAM – access and roles
52+
53+
The core idea: Do not copy file what are already in back up and copy over always delta log, Small manifests run in Lambda, big ones in ECS.
54+
55+
---
56+
57+
### How It Works
58+
59+
1. **Database Discovery**
60+
61+
Parse S3 Inventory manifests
62+
Identify database prefixes
63+
Queue for processing (up to 40 in parallel)
64+
65+
2. **Manifest Validation**
66+
67+
Before we touch data, we validate:
68+
- JSON structure
69+
- All CSV parts present
70+
- File counts + checksums match
71+
If incomplete → wait up to 30 minutes before retry
72+
73+
3. **Routing by Size**
74+
75+
- ≤25 files → Lambda (15 minutes, 5GB)
76+
- 25 files → ECS Fargate (16GB RAM, 4 vCPUs, unlimited runtime)
77+
78+
4. **Incremental Backup Logic**
79+
80+
- Load exclusion set from last backup
81+
- Always include delta logs
82+
- Only back up parquet files not yet in backup
83+
- Ignore non-STANDARD storage classes (we use Intelligent-Tiering; over time files can go to Glacier and we don’t want to touch them)
84+
- Process CSVs in parallel (20 workers)
85+
- Emit new manifest + checksum for integrity
86+
87+
5. **Copying Files**
88+
89+
- Feed manifests into S3 Batch Operations
90+
- Copy objects into Glacier storage
91+
92+
6. **Safe Deletion**
93+
94+
- Compare current inventory vs. incremental manifests
95+
- Identify parquet files that:
96+
- Were backed up successfully
97+
- No longer exist in source
98+
- Are older than 90 days
99+
- Tag them for deletion instead of deleting immediately
100+
- Deletion is performed using S3 lifecycle configuration for cost-optimized deletion
101+
- Tags include timestamps for rollback + audit
102+
103+
---
104+
105+
### Error Handling & Resilience
106+
107+
- Retries with exponential backoff + jitter
108+
- Strict validation before deletes
109+
- Exclusion lists ensure delta logs are never deleted
110+
- ECS tasks run in private subnets with VPC endpoints
111+
112+
---
113+
114+
### Cost & Performance Gains
115+
116+
- Incremental logic = no redundant transfers
117+
- Lifecycle rules = backups → Glacier, old ones cleaned
118+
- Size-based routing = Lambda for cheap jobs, ECS for heavy jobs
119+
- Parallelism = 20 CSV workers per manifest, 40 DBs at once
120+
121+
---
122+
123+
### Lessons Learned
124+
125+
- Always validate manifests before processing
126+
- Never delete immediately → tagging first saved us money
127+
- Thresholds matter: 25 files was our sweet spot
128+
- CloudWatch + Slack reports gave us visibility we didn’t have before
129+
130+
---
131+
132+
### Conclusion
133+
134+
By combining Lambda, ECS Fargate, and S3 Batch Ops, we’ve built a resilient backup system that scales from small to massive datasets. Instead of repeatedly copying the same files, the system now performs truly incremental backups — capturing only new or changed parquet files while always preserving delta logs. This not only minimizes costs but also dramatically reduces runtime.
135+
136+
Our safe deletion workflow ensures that stale data is removed without risk, using lifecycle-based cleanup rather than immediate deletion. Together, these design choices give us reliable backups, efficient scaling, and continuous optimization of storage. What used to be expensive, error-prone, and manual is now automated, predictable, and cost-effective.

files/backup_system_diagram.png

1.05 MB
Loading

0 commit comments

Comments
 (0)