⚠️ PROOF OF CONCEPT - NOT PRODUCTION READYThis is a proof-of-concept system generated by Claude AI and is UNTESTED. Do not deploy this in production environments without thorough testing, security review, and validation. Use at your own risk.
A comprehensive early warning system for OpenStack cloud infrastructure that monitors dataplane health across multiple datacenters and availability zones.
The OpenStack Canary system provides proactive monitoring of your OpenStack infrastructure by:
- Multi-AZ Deployment: Distributes canary instances across availability zones
- Synthetic Traffic Generation: Creates realistic workload patterns between instances
- Comprehensive Metrics: Tracks latency, throughput, system health, and application performance
- Early Warning Detection: Identifies dataplane issues before they impact production workloads
- Datadog Integration: Real-time monitoring, alerting, and dashboards
┌─────────────────────────────────────────────────────────────────┐
│ OpenStack Cloud Infrastructure │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ Datacenter 1 │ Datacenter 2 │ Datacenter N │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ ┌─────────────┐ │ ┌─────────────┐ │ ┌─────────────┬─────────────┐│
│ │ AZ-A │ │ │ AZ-A │ │ │ AZ-A │ AZ-B ││
│ │ ┌─────────┐ │ │ │ ┌─────────┐ │ │ │ ┌─────────┐ │ ┌─────────┐ ││
│ │ │ Canary │ │ │ │ │ Canary │ │ │ │ │ Canary │ │ │ Canary │ ││
│ │ │Instance │ │ │ │ │Instance │ │ │ │ │Instance │ │ │Instance │ ││
│ │ └─────────┘ │ │ │ └─────────┘ │ │ │ └─────────┘ │ └─────────┘ ││
│ └─────────────┘ │ └─────────────┘ │ └─────────────┴─────────────┘│
└─────────────────┴─────────────────┴─────────────────────────────┘
│
▼
┌─────────────────┐
│ Datadog │
│ Monitoring │
│ & Alerting │
└─────────────────┘
- Canary Application (
app.py
): Core web service with health endpoints - Traffic Generator (
traffic_generator.py
): Synthetic workload creation - System Monitor (
system_monitor.py
): OS-level monitoring and alerts - Datadog Integration (
datadog_config.py
): Metrics, dashboards, and alerting - Deployment Automation: Heat templates and Docker deployment scripts
-
Clone and configure:
git clone <repository> cd openstack-canary
-
Set up environment:
cp .env.example .env # Edit .env with your configuration
-
Deploy:
./docker-deploy.sh start --datacenter dc1 --az zone-a
-
Verify health:
curl http://localhost:8080/health
-
Configure deployment:
cp deploy-config.env.example deploy-config.env # Edit with your OpenStack configuration
-
Deploy to OpenStack:
./deploy.sh deploy --datacenter dc1 --azs "nova,zone-a,zone-b"
-
Set up monitoring:
./deploy.sh setup-datadog --dd-api-key YOUR_KEY --dd-app-key YOUR_APP_KEY
Variable | Description | Default |
---|---|---|
CANARY_ID |
Unique identifier for canary instance | canary-{hostname} |
DATACENTER |
Datacenter name for tagging | unknown |
AVAILABILITY_ZONE |
AZ name for tagging | unknown |
DD_API_KEY |
Datadog API key | - |
DD_APP_KEY |
Datadog Application key | - |
PEER_ENDPOINTS |
Comma-separated peer endpoints | - |
TRAFFIC_INTERVAL |
Traffic generation interval (seconds) | 10 |
MONITORING_INTERVAL |
System monitoring interval (seconds) | 30 |
Metric | Warning | Critical |
---|---|---|
CPU Usage | 70% | 90% |
Memory Usage | 80% | 95% |
Disk Usage | 85% | 95% |
Load Average | 5.0 | 10.0 |
Error Rate | 5% | 10% |
Peer Connectivity | 70% | 50% |
GET /health
- Basic health statusGET /health/detailed
- Detailed system metrics
GET /connectivity
- Test connectivity to peer instancesGET /load-test
- Generate synthetic load
GET /metrics
- Prometheus-style metrics
{
"status": "healthy",
"canary_id": "canary-dc1-zone-a-001",
"datacenter": "dc1",
"availability_zone": "zone-a",
"timestamp": "2024-01-15T10:30:00Z",
"uptime_seconds": 3600,
"system_metrics": {
"cpu_percent": 15.2,
"memory": {
"percent": 45.8,
"available": 2147483648
},
"disk": {
"percent": 25.3,
"free": 8589934592
}
}
}
The system automatically creates:
- Dashboard: Comprehensive overview of all canary instances
- Alerts: Proactive notifications for infrastructure issues
- SLOs: Service level objectives for availability tracking
canary.health_check
- Health check frequencycanary.peer_latency
- Inter-instance latencycanary.traffic_gen.success_rate
- Traffic generation success ratecanary.system.cpu_percent
- CPU utilizationcanary.system.memory_percent
- Memory utilization
- High Error Rate: Error rate > 10% for 5 minutes
- Instance Down: No health checks for 10 minutes
- High Latency: Inter-DC latency > 1000ms for 10 minutes
- Resource Exhaustion: CPU > 90% or Memory > 95% for 15 minutes
Pros: Easy setup, consistent environment, quick development Cons: Limited OS-level monitoring, single-host deployment
# Start all services
./docker-deploy.sh start
# View logs
./docker-deploy.sh logs canary-app
# Scale services
./docker-deploy.sh scale canary-app=3
# Health check
./docker-deploy.sh health
Pros: Multi-AZ deployment, native OpenStack integration, scalable Cons: Requires OpenStack environment, more complex setup
# Deploy across multiple AZs
./deploy.sh deploy -d dc1 -a "nova,zone-a,zone-b" -c 2
# Update deployment
./deploy.sh update -n canary-prod
# Check status
./deploy.sh status -n canary-prod
# View logs
./deploy.sh logs -n canary-prod
For custom environments:
# Install dependencies
pip install -r requirements.txt
# Start canary application
gunicorn --bind 0.0.0.0:8080 app:app
# Start traffic generator (separate terminal)
python traffic_generator.py
# Start system monitor (separate terminal)
python system_monitor.py
# Unit tests
python -m pytest tests/
# Integration tests
python -m pytest tests/integration/
# Load tests
python -m pytest tests/load/
# Build Docker image
docker build -t canary:latest .
# Build with custom tag
docker build -t canary:v1.2.3 .
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
# Check logs
./docker-deploy.sh logs canary-app
# Check container status
docker ps -a | grep canary
# Restart services
./docker-deploy.sh restart
# Check system resources
curl localhost:8080/health/detailed
# View container stats
docker stats canary-app
# Test connectivity
curl localhost:8080/connectivity
# Check network configuration
docker network ls
docker network inspect openstack-canary_canary-network
# Verify API keys
echo $DD_API_KEY | cut -c1-8
# Check agent status
docker exec datadog-agent agent status
# Restart agent
docker restart datadog-agent
- Docker:
docker logs <container_name>
- OpenStack:
/var/log/canary/
- System:
/var/log/syslog
# Increase worker processes
export GUNICORN_WORKERS=4
# Adjust traffic intervals
export TRAFFIC_INTERVAL=30
# Scale horizontally
./deploy.sh scale 5
# Monitor resource usage
curl localhost:8080/metrics | grep canary_
# Adjust monitoring intervals
export MONITORING_INTERVAL=60
- Network Security: Use security groups to restrict access
- API Keys: Store Datadog keys securely, use environment variables
- Container Security: Run containers as non-root user
- TLS: Enable HTTPS for production deployments
- Monitoring: Monitor for unusual traffic patterns
- Weekly: Review error rates and performance metrics
- Monthly: Update Docker images and dependencies
- Quarterly: Review and update alert thresholds
# Backup configuration and data
./docker-deploy.sh backup
# Restore from backup
./docker-deploy.sh restore /path/to/backup
# Update Docker deployment
./docker-deploy.sh update
# Update OpenStack deployment
./deploy.sh update -n canary-prod
For issues and questions:
- Check the Troubleshooting section
- Review logs for error messages
- Check Datadog dashboard for system health
- Contact the infrastructure team
This project is licensed under the MIT License - see the LICENSE file for details.
- Initial release
- Multi-AZ deployment support
- Datadog integration
- Docker containerization
- OpenStack Heat templates
- Comprehensive monitoring and alerting