Intelligent Kubernetes log analysis with AI-powered recommendations
Automatically detect and diagnose errors in your Kubernetes pods using pattern matching and OpenAI. Get instant fix recommendations for OutOfMemory errors, connection timeouts, database failures, and 15+ other common issues.
- 🔍 17 Pre-configured Error Patterns - Detects OOM, connection timeouts, database errors, HTTP 5xx, disk issues, and more
- 🤖 AI-Powered Recommendations - OpenAI GPT-4o-mini provides actionable fix suggestions
- ⚡ Two Modes: CLI for ad-hoc debugging, Continuous monitor for real-time alerting
- 🎯 Multi-Namespace Support - Monitor multiple namespaces simultaneously
- 🔔 Smart Alerting - Alert deduplication and severity-based filtering (CRITICAL, HIGH, MEDIUM, LOW)
- 💰 Cost Tracking - Shows token usage and estimated cost per analysis
- 🚀 Production Ready - Incremental log reading, proper logging, configurable intervals
k8s-log-monitor/
├── cli/ # CLI tool for ad-hoc debugging
│ ├── debug-logs.py
│ ├── requirements-cli.txt
│ └── CLI-USAGE.md
├── k8s/ # Kubernetes manifests
│ ├── configmap.yaml
│ ├── rbac.yaml
│ └── deployment.yaml
├── docker/ # Continuous monitor
│ ├── Dockerfile
│ ├── monitor.py
│ └── requirements.txt
└── README.md
cd cli
python3 -m venv venv
source venv/bin/activate
pip install -r requirements-cli.txt
export OPENAI_API_KEY="your-key"
./debug-logs.py <pod-name> -n <namespace># Build
cd docker
docker build -t log-monitor:latest .
# Deploy
cd ../k8s
kubectl create namespace monitoring
kubectl apply -f configmap.yaml
kubectl apply -f rbac.yaml
kubectl apply -f deployment.yaml
# View logs
kubectl logs -f -n monitoring deployment/log-monitorEdit k8s/configmap.yaml to add/modify patterns:
{
"name": "OutOfMemory",
"regex": "OutOfMemoryError|OOMKilled|out of memory",
"severity": "critical"
}TARGET_NAMESPACES- Comma-separated namespaces to monitor (default: "default")ENABLE_LLM_RECOMMENDATIONS- Enable AI recommendations (default: "false")OPENAI_API_KEY- Your OpenAI API keyPOLL_INTERVAL_SECONDS- Log polling interval (default: 30)ALERT_DEDUPE_SECONDS- Alert deduplication window (default: 60)LLM_COOLDOWN_SECONDS- LLM call cooldown per pod+error (default: 300)
- Incident Response: Quickly diagnose production issues with AI recommendations
- Development: Catch errors early during local testing with Minikube
- Monitoring: Continuous alerting for critical errors across multiple namespaces
- Cost Optimization: Pay-per-use with CLI vs continuous monitoring
OutOfMemory • ConnectionTimeout • NetworkError • DatabaseError • AuthenticationFailure • AuthorizationFailure • HTTP5xx • HTTP4xx • DiskFull • ReadOnlyFilesystem • CrashLoopBackOff • ImagePullError • ProbeFailures • TLS/SSL Errors • ThreadDeadlock • Application Exceptions
Contributions welcome! Add new error patterns, improve AI prompts, or enhance the monitoring logic.
MIT