Skip to content

OpShaid/drdroid-observability-stack

Repository files navigation

πŸš€ DrDroid Observability Stack

Author: Shaid T


🎯 Project Overview

This project demonstrates a comprehensive observability stack integrating:

  • 11 microservices (Google's microservices-demo)
  • Full monitoring (Prometheus, Grafana, Loki, Jaeger)
  • Chaos engineering (Chaos Mesh)
  • Multi-channel alerting (Slack, AlertManager)
  • AI-powered incident management (DrDroid platform)
  • Database persistence (PostgreSQL)

✨ Features

Feature Description Status
πŸ“Š Microservices Demo 11-service e-commerce application βœ… Production
πŸ” Prometheus Metrics Real-time metrics collection & alerting βœ… Active
πŸ“ˆ Grafana Dashboards Business & technical metrics visualization βœ… Live
πŸ“ Loki Log Aggregation Centralized logging with Promtail βœ… Streaming
🎯 Jaeger Tracing Distributed request tracing βœ… Bonus Feature
πŸ’Ύ PostgreSQL Database Persistent order data storage βœ… Integrated
πŸŒͺ️ Chaos Engineering 4 fault injection scenarios βœ… Active
🚨 Multi-Channel Alerts Slack + DrDroid integration βœ… Connected
πŸ€– DrDroid AI Platform Intelligent incident analysis βœ… Integrated

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              k3d Cluster (3 nodes)                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                         β”‚
β”‚  Microservices Layer                                    β”‚
β”‚  β”œβ”€ frontend                                           β”‚
β”‚  β”œβ”€ cartservice                                        β”‚
β”‚  β”œβ”€ checkoutservice                                    β”‚
β”‚  β”œβ”€ productcatalogservice                              β”‚
β”‚  └─ 7 more services...                                 β”‚
β”‚                                                         β”‚
β”‚  Data Layer                                            β”‚
β”‚  └─ PostgreSQL (Order persistence)                     β”‚
β”‚                                                         β”‚
β”‚  Observability Stack                                   β”‚
β”‚  β”œβ”€ Prometheus β†’ Metrics & Alerting                    β”‚
β”‚  β”œβ”€ Grafana β†’ Dashboards & Visualization              β”‚
β”‚  β”œβ”€ Loki β†’ Log Aggregation                            β”‚
β”‚  β”œβ”€ Jaeger β†’ Distributed Tracing                      β”‚
β”‚  └─ AlertManager β†’ Alert Routing                       β”‚
β”‚                                                         β”‚
β”‚  Chaos Engineering                                     β”‚
β”‚  └─ Chaos Mesh β†’ Fault Injection                      β”‚
β”‚                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🚨 Alert Pipeline

Chaos Experiment
      ↓
Metrics Spike (CPU/Memory/Errors)
      ↓
Prometheus Scrapes (every 15s)
      ↓
Alert Rule Evaluates (2min threshold)
      ↓
AlertManager Routes Alert
      ↓
  β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β”
  ↓        ↓
Slack   DrDroid
(Team)  (AI Analysis)

Alert Rules

Alert Condition Severity Action
HighPodCPU CPU > 80% for 2min Warning Slack notification
HighPodMemory Memory > 500MB Warning Slack notification
PodNotRunning Pod not in Running state Critical Slack + Investigation
PodFrequentRestarts >3 restarts in 5min Warning Auto-remediation trigger

πŸŒͺ️ Chaos Engineering Scenarios

1. CPU Stress Test

Purpose: Test high resource utilization handling
Target: Frontend service
Expected Behavior:

  • CPU spikes above 80%
  • Prometheus alert fires after 2 minutes
  • Slack notification sent
  • DrDroid correlates with metrics
kubectl apply -f manifests/chaos/cpu-stress-chaos.yaml

2. Pod Kill Test

Purpose: Test Kubernetes self-healing
Target: Cart service
Expected Behavior:

  • Pod terminated
  • Kubernetes restarts pod automatically
  • Brief service disruption
  • Alert fires for pod downtime
kubectl apply -f manifests/chaos/pod-kill-chaos.yaml

3. Network Latency Test

Purpose: Test degraded network performance
Target: Checkout β†’ Payment communication
Expected Behavior:

  • 500ms latency injected
  • Request timeouts increase
  • User experience degrades
  • Tracing shows bottleneck
kubectl apply -f manifests/chaos/network-chaos.yaml

4. HTTP Error Injection

Purpose: Test error handling & logging
Target: Product catalog service
Expected Behavior:

  • HTTP 500 errors injected
  • Error rate spikes in metrics
  • Logs capture exceptions
  • Alert fires for high error rate
kubectl apply -f manifests/chaos/http-chaos.yaml

Stop any chaos experiment:

kubectl delete -f manifests/chaos/<chaos-file>.yaml
# Or delete all
kubectl delete podchaos,networkchaos,stresschaos,httpchaos --all -n default

πŸš€ Quick Start

Prerequisites

  • Docker
  • kubectl
  • helm
  • k3d

Installation

# 1. Clone the repository
git clone https://github.com/OpShaid/drdroid-observability-stack.git
cd drdroid-observability-stack

# 2. Run setup script (installs dependencies)
./setup.sh

# 3. Deploy everything
./s.sh

# 4. Wait for all pods to be ready (2-3 minutes)
kubectl get pods --all-namespaces -w

Manual Setup

# 1. Create k3d cluster
k3d cluster create drdroid-demo --agents 2

# 2. Deploy microservices
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml

# 3. Install monitoring stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.adminPassword=drdroid2024

# 4. Install Loki for logs
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack -n monitoring \
  --set grafana.enabled=false \
  --set promtail.enabled=true

# 5. Deploy Jaeger for tracing
kubectl apply -f manifests/tracing/jaeger-all-in-one.yaml

# 6. Install Chaos Mesh
curl -sSL https://mirrors.chaos-mesh.org/v2.6.3/install.sh | bash

# 7. Deploy PostgreSQL database
kubectl apply -f manifests/database/postgres.yaml

# 8. Apply Prometheus alert rules
kubectl apply -f manifests/alerting/prometheus-rules-patch.yaml

# 9. Configure AlertManager for Slack
kubectl apply -f manifests/alerting/alertmanager-config.yaml

🌐 Access Services

Local Access

Service URL Credentials
Grafana http://localhost:3000 admin / drdroid2024
Prometheus http://localhost:9090 -
AlertManager http://localhost:9093 -
Jaeger http://localhost:16686 -
Microservices Frontend http://localhost:8080 -

Port-forward commands:

# Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-grafana 3000:80 &

# Prometheus
kubectl port-forward -n monitoring svc/kube-prometheus-prometheus 9090:9090 &

# AlertManager
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093 &

# Jaeger
kubectl port-forward -n default svc/jaeger-query 16686:16686 &

# Frontend
kubectl port-forward -n default svc/frontend 8080:80 &

External Access (via ngrok)

# Expose Grafana externally
ngrok http 3000

# Expose Prometheus
ngrok http 9090

# Use these URLs in DrDroid integrations

πŸ”— DrDroid Integrations

Connected Integrations

Integration Status URL/Configuration
Kubernetes 🟒 Active Agent deployed via proxy token
Grafana 🟒 Active https://xxx.ngrok-free.app
Prometheus 🟒 Active http://xxx.ngrok-free.app
Slack 🟒 Active #drdroid-alerts channel
GitHub 🟒 Active Repository connected

Integration Setup

Kubernetes Agent:

cd drd-vpc-agent
./deploy_k8s.sh <PROXY_TOKEN>

Grafana + Prometheus:

  • Use ngrok URLs or IP-based endpoints
  • Add in DrDroid platform under Integrations

Slack:

  • Webhook URL configured in AlertManager
  • Channel: #drdroid-alerts

πŸ“Š Monitoring & Dashboards

Pre-configured Dashboards

  1. Kubernetes Cluster Overview

    • CPU, Memory, Network across all nodes
    • Pod count and status
    • Resource utilization trends
  2. Microservices Performance

    • Request rate per service
    • Latency percentiles (p50, p95, p99)
    • Error rates
  3. Business Metrics (Custom)

    • Total orders processed
    • Order success rate
    • Revenue per hour
    • Checkout conversion funnel
  4. Alert Dashboard

    • Active alerts by severity
    • Alert frequency over time
    • MTTD and MTTR metrics

Key Metrics

# CPU Usage
rate(container_cpu_usage_seconds_total[5m])

# Memory Usage
container_memory_usage_bytes

# Request Rate
rate(http_requests_total[5m])

# Error Rate
rate(http_requests_total{status=~"5.."}[5m])

# Pod Restarts
kube_pod_container_status_restarts_total

πŸ’Ύ Database Integration

PostgreSQL Setup

Connection Details:

  • Host: postgres-service.default.svc.cluster.local
  • Port: 5432
  • Database: orders
  • User: postgres

Schema:

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id VARCHAR(255),
    order_total DECIMAL(10,2),
    items JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

Query Orders:

kubectl exec -it <postgres-pod> -n default -- psql -U postgres -d orders -c "SELECT * FROM orders LIMIT 10;"

πŸ§ͺ Testing Scenarios

End-to-End Test

# 1. Trigger chaos
kubectl apply -f manifests/chaos/cpu-stress-chaos.yaml

# 2. Monitor in Grafana
# Open: http://localhost:3000
# Navigate to: Kubernetes / Compute Resources / Cluster

# 3. Wait for alert (2-3 minutes)
# Check: http://localhost:9090/alerts

# 4. Verify Slack notification
# Check #drdroid-alerts channel

# 5. Check DrDroid incident
# Open: https://aiops.drdroid.io/incidents

# 6. Clean up
kubectl delete -f manifests/chaos/cpu-stress-chaos.yaml

πŸ“ˆ Production Considerations

What's Production-Ready

βœ… High availability deployments
βœ… Resource limits and requests configured
βœ… Health checks and readiness probes
βœ… Structured logging with correlation IDs
βœ… Metrics instrumentation
βœ… Alert rules with proper thresholds

What Would Be Added for Production

  • Persistent Storage: Thanos for long-term Prometheus metrics, S3 for Loki
  • High Availability: Multi-replica AlertManager, Grafana, Prometheus
  • Security: Vault for secrets, RBAC policies, network policies, mTLS
  • Disaster Recovery: Velero for cluster backups, cross-region replication
  • Cost Optimization: OpenCost integration, resource right-sizing
  • Distributed Tracing: Full service instrumentation with OpenTelemetry
  • Incident Management: PagerDuty/Opsgenie integration with on-call rotations
  • CI/CD: ArgoCD for GitOps deployments
  • Service Mesh: Istio for advanced traffic management and security

πŸ› οΈ Troubleshooting

Common Issues

Pods not starting:

kubectl get pods --all-namespaces
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>

Grafana not accessible:

kubectl port-forward -n monitoring svc/kube-prometheus-grafana 3000:80
# Access: http://localhost:3000

Alerts not firing:

# Check Prometheus targets
kubectl port-forward -n monitoring svc/kube-prometheus-prometheus 9090:9090
# Open: http://localhost:9090/targets

# Check AlertManager
kubectl logs -n monitoring alertmanager-kube-prometheus-kube-prome-alertmanager-0

Slack notifications not working:

# Verify webhook URL
kubectl get secret -n monitoring alertmanager-kube-prometheus-alertmanager -o yaml

# Test webhook manually
curl -X POST -H 'Content-type: application/json' \
  --data '{"text":"Test alert"}' \
  https://hooks.slack.com/services/YOUR/WEBHOOK/URL


πŸ™ Acknowledgments


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published