Lazarus is a production-grade Kubernetes operator that automatically validates backup recovery by creating isolated test restores, running health checks, and measuring recovery metrics (RTO/RPO). Built with Kopf and designed to integrate seamlessly with Velero.
Organizations invest heavily in backup infrastructure, but most backups are never tested until disaster strikes. When that happens, teams discover:
- Backups are corrupted or incomplete
- Recovery procedures don't work as documented
- RTO/RPO SLAs are wildly inaccurate
- No one knows how to actually restore
Result: Extended downtime, data loss, angry customers, and career-limiting events.
Lazarus automatically tests every backup by:
- Detecting Velero backup completion
- Creating isolated test restores in temporary namespaces
- Validating resources with configurable health checks (database queries, HTTP endpoints, custom tests)
- Measuring actual RTO/RPO metrics
- Alerting on failures via Slack/PagerDuty
- Cleaning up test resources automatically
- Fully Automated - Zero manual intervention required
- Velero Integration - Native support for Velero backups and restores
- Health Checks - Database, HTTP, and custom validation
- Prometheus Metrics - Production-ready observability
- Smart Notifications - Slack alerts on failures
- Isolated Testing - Test namespaces with configurable TTL
- Fast & Efficient - Parallel health checks, async operations
- Secure - Non-root containers, RBAC policies, secret handling
- Helm Chart - Production-ready deployment
- Portfolio-Ready - Clean code, comprehensive tests, excellent documentation
- Kubernetes 1.25+
- Velero installed and configured
- Helm 3.x
# Add Helm repository
helm repo add lazarus https://yourusername.github.io/lazarus-operator
helm repo update
# Install operator
helm install lazarus lazarus/lazarus \
--namespace lazarus-system \
--create-namespace \
--set config.velero.namespace=velero
# Verify installation
kubectl get pods -n lazarus-system
kubectl get crds | grep lazarus# Create a simple restore test
cat <<EOF | kubectl apply -f -
apiVersion: lazarus.io/v1alpha1
kind: LaziusRestoreTest
metadata:
name: my-first-test
namespace: lazarus-system
spec:
backupName: my-app-backup-20251231
healthChecks:
enabled: true
http:
enabled: true
endpoints:
- name: app-health
url: http://my-app:8080/health
expectedStatus: 200
EOF
# Watch progress
kubectl get lazarusrestoretests -n lazarus-system -w
kubectl describe lazarusrestoretest my-first-test -n lazarus-system- Installation Guide - Detailed setup instructions
- Usage Guide - Creating and managing restore tests
- Configuration Reference - All configuration options
- Health Checks - Database, HTTP, and custom checks
- Metrics & Monitoring - Prometheus integration
- Troubleshooting - Common issues and solutions
- Architecture - Design and implementation details
- Development - Contributing guide
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Cluster β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββ΄ββββββββββββ
βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ
β Velero β β Lazarus β
β (Backups) β β (Operator) β
ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ
β β
Backup Completed Creates Test
β β
βββββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
βΌ βΌ βΌ
Restore Health Checks Metrics
Resources (DB/HTTP/Custom) (Prometheus)
Lazarus exposes production-ready Prometheus metrics:
# Test success rate (last 24h)
sum(rate(lazarus_restore_tests_total{result="success"}[24h]))
/ sum(rate(lazarus_restore_tests_total[24h]))
# Average RTO by backup
avg(lazarus_recovery_time_objective_seconds) by (backup_name)
# Failed tests requiring attention
lazarus_restore_tests_total{result="failure"}
Test every backup immediately after creation to catch corruption early.
Demonstrate backup recoverability for SOC2, HIPAA, PCI-DSS audits.
Measure actual recovery times vs. SLA commitments.
Run weekly/monthly DR drills automatically without manual effort.
Test backups before promoting to production.
- Python 3.11+ - Modern, type-safe Python
- Kopf - Kubernetes operator framework
- Kubernetes Client - Official Python client
- Prometheus Client - Metrics and monitoring
- AsyncIO - Concurrent operations
- Pydantic - Configuration validation
- StructLog - Structured logging
- Poetry - Dependency management
- Pytest - Comprehensive testing
- Black/Ruff - Code formatting and linting
- MyPy - Static type checking
# Run unit tests
make test
# Run with coverage
make test-coverage
# Run linters
make lint
# Format code
make format
# Run all checks
make test lint- Core operator functionality
- Database health checks (PostgreSQL, MySQL, MongoDB)
- HTTP endpoint validation
- Prometheus metrics
- Slack notifications
- Helm chart
- Custom pod-based health checks
- Policy-based scheduled testing
- Multi-cluster support
- Advanced data validation (checksums, row counts)
- PagerDuty integration
- Grafana dashboard templates
- Cost tracking per test
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Velero - Kubernetes backup and restore
- Kopf - Kubernetes operator framework
- Inspired by the need for reliable disaster recovery in production systems
- Author: Rajesh Ramesh
- Email: rramesh17993@gmail.com
- GitHub: @rramesh17993
- LinkedIn: Your Profile
β Star this repo if you find it useful! β
Built with β€οΈ for SREs, Platform Engineers, and anyone who cares about reliable backups.
