Author: Shaid T
This project demonstrates a comprehensive observability stack integrating:
- 11 microservices (Google's microservices-demo)
- Full monitoring (Prometheus, Grafana, Loki, Jaeger)
- Chaos engineering (Chaos Mesh)
- Multi-channel alerting (Slack, AlertManager)
- AI-powered incident management (DrDroid platform)
- Database persistence (PostgreSQL)
| Feature | Description | Status |
|---|---|---|
| π Microservices Demo | 11-service e-commerce application | β Production |
| π Prometheus Metrics | Real-time metrics collection & alerting | β Active |
| π Grafana Dashboards | Business & technical metrics visualization | β Live |
| π Loki Log Aggregation | Centralized logging with Promtail | β Streaming |
| π― Jaeger Tracing | Distributed request tracing | β Bonus Feature |
| πΎ PostgreSQL Database | Persistent order data storage | β Integrated |
| πͺοΈ Chaos Engineering | 4 fault injection scenarios | β Active |
| π¨ Multi-Channel Alerts | Slack + DrDroid integration | β Connected |
| π€ DrDroid AI Platform | Intelligent incident analysis | β Integrated |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β k3d Cluster (3 nodes) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Microservices Layer β
β ββ frontend β
β ββ cartservice β
β ββ checkoutservice β
β ββ productcatalogservice β
β ββ 7 more services... β
β β
β Data Layer β
β ββ PostgreSQL (Order persistence) β
β β
β Observability Stack β
β ββ Prometheus β Metrics & Alerting β
β ββ Grafana β Dashboards & Visualization β
β ββ Loki β Log Aggregation β
β ββ Jaeger β Distributed Tracing β
β ββ AlertManager β Alert Routing β
β β
β Chaos Engineering β
β ββ Chaos Mesh β Fault Injection β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Chaos Experiment
β
Metrics Spike (CPU/Memory/Errors)
β
Prometheus Scrapes (every 15s)
β
Alert Rule Evaluates (2min threshold)
β
AlertManager Routes Alert
β
βββββ΄βββββ
β β
Slack DrDroid
(Team) (AI Analysis)
| Alert | Condition | Severity | Action |
|---|---|---|---|
| HighPodCPU | CPU > 80% for 2min | Warning | Slack notification |
| HighPodMemory | Memory > 500MB | Warning | Slack notification |
| PodNotRunning | Pod not in Running state | Critical | Slack + Investigation |
| PodFrequentRestarts | >3 restarts in 5min | Warning | Auto-remediation trigger |
Purpose: Test high resource utilization handling
Target: Frontend service
Expected Behavior:
- CPU spikes above 80%
- Prometheus alert fires after 2 minutes
- Slack notification sent
- DrDroid correlates with metrics
kubectl apply -f manifests/chaos/cpu-stress-chaos.yamlPurpose: Test Kubernetes self-healing
Target: Cart service
Expected Behavior:
- Pod terminated
- Kubernetes restarts pod automatically
- Brief service disruption
- Alert fires for pod downtime
kubectl apply -f manifests/chaos/pod-kill-chaos.yamlPurpose: Test degraded network performance
Target: Checkout β Payment communication
Expected Behavior:
- 500ms latency injected
- Request timeouts increase
- User experience degrades
- Tracing shows bottleneck
kubectl apply -f manifests/chaos/network-chaos.yamlPurpose: Test error handling & logging
Target: Product catalog service
Expected Behavior:
- HTTP 500 errors injected
- Error rate spikes in metrics
- Logs capture exceptions
- Alert fires for high error rate
kubectl apply -f manifests/chaos/http-chaos.yamlStop any chaos experiment:
kubectl delete -f manifests/chaos/<chaos-file>.yaml
# Or delete all
kubectl delete podchaos,networkchaos,stresschaos,httpchaos --all -n default- Docker
- kubectl
- helm
- k3d
# 1. Clone the repository
git clone https://github.com/OpShaid/drdroid-observability-stack.git
cd drdroid-observability-stack
# 2. Run setup script (installs dependencies)
./setup.sh
# 3. Deploy everything
./s.sh
# 4. Wait for all pods to be ready (2-3 minutes)
kubectl get pods --all-namespaces -w# 1. Create k3d cluster
k3d cluster create drdroid-demo --agents 2
# 2. Deploy microservices
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml
# 3. Install monitoring stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set grafana.adminPassword=drdroid2024
# 4. Install Loki for logs
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack -n monitoring \
--set grafana.enabled=false \
--set promtail.enabled=true
# 5. Deploy Jaeger for tracing
kubectl apply -f manifests/tracing/jaeger-all-in-one.yaml
# 6. Install Chaos Mesh
curl -sSL https://mirrors.chaos-mesh.org/v2.6.3/install.sh | bash
# 7. Deploy PostgreSQL database
kubectl apply -f manifests/database/postgres.yaml
# 8. Apply Prometheus alert rules
kubectl apply -f manifests/alerting/prometheus-rules-patch.yaml
# 9. Configure AlertManager for Slack
kubectl apply -f manifests/alerting/alertmanager-config.yaml| Service | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3000 | admin / drdroid2024 |
| Prometheus | http://localhost:9090 | - |
| AlertManager | http://localhost:9093 | - |
| Jaeger | http://localhost:16686 | - |
| Microservices Frontend | http://localhost:8080 | - |
Port-forward commands:
# Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-grafana 3000:80 &
# Prometheus
kubectl port-forward -n monitoring svc/kube-prometheus-prometheus 9090:9090 &
# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093 &
# Jaeger
kubectl port-forward -n default svc/jaeger-query 16686:16686 &
# Frontend
kubectl port-forward -n default svc/frontend 8080:80 &# Expose Grafana externally
ngrok http 3000
# Expose Prometheus
ngrok http 9090
# Use these URLs in DrDroid integrations| Integration | Status | URL/Configuration |
|---|---|---|
| Kubernetes | π’ Active | Agent deployed via proxy token |
| Grafana | π’ Active | https://xxx.ngrok-free.app |
| Prometheus | π’ Active | http://xxx.ngrok-free.app |
| Slack | π’ Active | #drdroid-alerts channel |
| GitHub | π’ Active | Repository connected |
Kubernetes Agent:
cd drd-vpc-agent
./deploy_k8s.sh <PROXY_TOKEN>Grafana + Prometheus:
- Use ngrok URLs or IP-based endpoints
- Add in DrDroid platform under Integrations
Slack:
- Webhook URL configured in AlertManager
- Channel: #drdroid-alerts
-
Kubernetes Cluster Overview
- CPU, Memory, Network across all nodes
- Pod count and status
- Resource utilization trends
-
Microservices Performance
- Request rate per service
- Latency percentiles (p50, p95, p99)
- Error rates
-
Business Metrics (Custom)
- Total orders processed
- Order success rate
- Revenue per hour
- Checkout conversion funnel
-
Alert Dashboard
- Active alerts by severity
- Alert frequency over time
- MTTD and MTTR metrics
# CPU Usage
rate(container_cpu_usage_seconds_total[5m])
# Memory Usage
container_memory_usage_bytes
# Request Rate
rate(http_requests_total[5m])
# Error Rate
rate(http_requests_total{status=~"5.."}[5m])
# Pod Restarts
kube_pod_container_status_restarts_total
Connection Details:
- Host:
postgres-service.default.svc.cluster.local - Port:
5432 - Database:
orders - User:
postgres
Schema:
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id VARCHAR(255),
order_total DECIMAL(10,2),
items JSONB,
created_at TIMESTAMP DEFAULT NOW()
);Query Orders:
kubectl exec -it <postgres-pod> -n default -- psql -U postgres -d orders -c "SELECT * FROM orders LIMIT 10;"# 1. Trigger chaos
kubectl apply -f manifests/chaos/cpu-stress-chaos.yaml
# 2. Monitor in Grafana
# Open: http://localhost:3000
# Navigate to: Kubernetes / Compute Resources / Cluster
# 3. Wait for alert (2-3 minutes)
# Check: http://localhost:9090/alerts
# 4. Verify Slack notification
# Check #drdroid-alerts channel
# 5. Check DrDroid incident
# Open: https://aiops.drdroid.io/incidents
# 6. Clean up
kubectl delete -f manifests/chaos/cpu-stress-chaos.yamlβ
High availability deployments
β
Resource limits and requests configured
β
Health checks and readiness probes
β
Structured logging with correlation IDs
β
Metrics instrumentation
β
Alert rules with proper thresholds
- Persistent Storage: Thanos for long-term Prometheus metrics, S3 for Loki
- High Availability: Multi-replica AlertManager, Grafana, Prometheus
- Security: Vault for secrets, RBAC policies, network policies, mTLS
- Disaster Recovery: Velero for cluster backups, cross-region replication
- Cost Optimization: OpenCost integration, resource right-sizing
- Distributed Tracing: Full service instrumentation with OpenTelemetry
- Incident Management: PagerDuty/Opsgenie integration with on-call rotations
- CI/CD: ArgoCD for GitOps deployments
- Service Mesh: Istio for advanced traffic management and security
Pods not starting:
kubectl get pods --all-namespaces
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>Grafana not accessible:
kubectl port-forward -n monitoring svc/kube-prometheus-grafana 3000:80
# Access: http://localhost:3000Alerts not firing:
# Check Prometheus targets
kubectl port-forward -n monitoring svc/kube-prometheus-prometheus 9090:9090
# Open: http://localhost:9090/targets
# Check AlertManager
kubectl logs -n monitoring alertmanager-kube-prometheus-kube-prome-alertmanager-0Slack notifications not working:
# Verify webhook URL
kubectl get secret -n monitoring alertmanager-kube-prometheus-alertmanager -o yaml
# Test webhook manually
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"Test alert"}' \
https://hooks.slack.com/services/YOUR/WEBHOOK/URL- Google Cloud Platform
- Prometheus Community - Monitoring ecosystem
- Grafana Labs - Visualization platform
- Chaos Mesh - Chaos engineering platform
- DrDroid - AI-powered incident management