A production-ready multi-agent AI system that analyzes incidents using 6 specialized AI agents with autonomous tool calling, powered by OpenAI GPT-4o and Claude 3.5.
DevOps Incident Response Team is an intelligent multi-agent system that mimics how a real DevOps team investigates production incidents. When you report an incident, a Commander Agent deploys specialist agents (Log Analyzer, Database Expert, Network Specialist, Memory, Security, Performance) to investigate. Each agent autonomously calls diagnostic tools to gather real data, then provides expert analysis. The Commander synthesizes their findings into an actionable incident report.
- 🤖 6 Specialist Agents - Memory, Security, Performance, Log Analyzer, Database, Network experts
- 🛠️ Autonomous Tool Calling - Agents execute diagnostic tools to gather real production data
- 🧠 Intelligent Orchestration - Commander Agent automatically deploys relevant specialists
- 💰 Cost-Optimized - GPT-4o-mini for specialists, GPT-4o for synthesis (dual OpenAI/Claude support)
- 🌐 REST API + WebSocket - Production FastAPI backend with real-time updates
- 📊 Real-Time Insights - Beautiful terminal UI and WebSocket streaming
- 💡 Actionable Reports - Root cause analysis with prioritized action items
- 🐳 Docker Ready - Full containerization with docker-compose orchestration
Incident Reported (REST API or CLI)
↓
┌──────────────────┐
│ Commander Agent │ ← Orchestrates the team
│ (GPT-4o) │
└──────────────────┘
↓
Analyzes incident
Deploys specialists
↓
┌─────────────────────────────────────────────────────────┐
│ 6 Specialist Agents (GPT-4o-mini) │
│ │
│ ┌──────────────┐ ┌───────────────┐ ┌─────────────┐ │
│ │ Log Analyzer │ │ Database Agent│ │Network Agent│ │
│ │ + Tools │ │ + Tools │ │ + Tools │ │
│ └──────────────┘ └───────────────┘ └─────────────┘ │
│ │
│ ┌──────────────┐ ┌───────────────┐ ┌─────────────┐ │
│ │Memory Agent │ │Security Agent │ │Performance │ │
│ │ (ChromaDB) │ │ + Tools │ │ Agent │ │
│ └──────────────┘ └───────────────┘ └─────────────┘ │
│ │
│ Each agent can call tools autonomously: │
│ - get_metrics, fetch_logs, query_database │
│ - check_service_status, search_documentation │
└─────────────────────────────────────────────────────────┘
↓
Results collected via WebSocket
↓
┌──────────────────┐
│ Synthesizer │ ← Combines findings
│ (GPT-4o) │
└──────────────────┘
↓
Incident Report (JSON + Real-time streaming)
- Root Cause
- Action Items
- Confidence Score
- Cost & Metrics
| Agent | Expertise | Model | Tools | Purpose |
|---|---|---|---|---|
| Commander | Orchestration | GPT-4o | None | Routes incident to specialists, coordinates investigation |
| Log Analyzer | Logs & Stack Traces | GPT-4o-mini | fetch_logs | Parses error messages, identifies patterns |
| Database Agent | Database Issues | GPT-4o-mini | query_database | Connection pools, queries, deadlocks |
| Network Agent | Connectivity | GPT-4o-mini | check_service_status | DNS, SSL, timeouts, HTTP errors |
| Memory Agent | Past Incidents | GPT-4o-mini | ChromaDB | Finds similar incidents, learns from history |
| Security Agent | Vulnerabilities | GPT-4o-mini | secret_scanner | Exposed secrets, OWASP Top 10, CVEs |
| Performance Agent | Bottlenecks | GPT-4o-mini | get_metrics | CPU, memory, latency, throughput analysis |
| Synthesizer | Root Cause Analysis | GPT-4o | None | Combines all findings into actionable report |
- Python 3.12+
- OpenAI API key (get one here) - Primary provider
- Anthropic API key (get one here) - Optional secondary provider
# Clone repository
git clone https://github.com/yourusername/devops-agent-team.git
cd devops-agent-team
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY (and optionally ANTHROPIC_API_KEY)python main.py# Start the FastAPI server
uvicorn api.app:app --host 0.0.0.0 --port 8000
# Or use Docker
docker-compose up --buildAccess the API:
- Swagger docs: http://localhost:8000/docs
- Health check: http://localhost:8000/health
- WebSocket: ws://localhost:8000/ws/{incident_id}
# Terminal 1: Start backend
uvicorn api.app:app --host 0.0.0.0 --port 8000
# Terminal 2: Start frontend
cd frontend
npm install
npm run devAccess the dashboard: http://localhost:3000
Description: API endpoint returning 500 errors
Error Message:
[ERROR] Database connection failed
psycopg2.pool.PoolExhaustedError: connection pool exhausted (size: 10, max: 10)
[WARN] Queue backing up: 145 pending requests
🚨 INCIDENT DETECTED
Description: API endpoint returning 500 errors
Time: 2025-01-15 10:23:45
🤖 Deploying 2 specialist agents
┌─ Log Analyzer ──────────────────────────────┐
│ ✅ Analysis Complete │
│ Confidence: 90% | Cost: $0.0023 | Time: 1.2s│
│ Suggestions: 3 │
└─────────────────────────────────────────────┘
┌─ Database Agent ────────────────────────────┐
│ ✅ Analysis Complete │
│ Confidence: 95% | Cost: $0.0019 | Time: 1.5s│
│ Suggestions: 4 │
└─────────────────────────────────────────────┘
🧠 Synthesizing findings from all agents...
📋 INCIDENT REPORT
Root Cause:
Database connection pool is undersized for current load.
The pool is configured with only 10 connections, which is
being exhausted during peak traffic, causing API requests
to fail.
Recommendations:
1. Immediately increase pool size from 10 to 50 connections
2. Configure pool timeout to 30 seconds with proper error handling
3. Add connection pool monitoring and alerting
4. Review slow queries that might be holding connections
5. Implement connection retry logic with exponential backoff
Total Cost: $0.0158 | Investigation Time: 4.2s
Typical investigation costs:
| Scenario | Agents Deployed | Total Cost | Time |
|---|---|---|---|
| Simple log error | 1 (Log Analyzer) | $0.003 | 1-2s |
| Database issue | 2 (Log + Database) | $0.008 | 2-3s |
| Complex multi-layer | 3-4 agents + Synthesis | $0.015-0.025 | 4-6s |
Per incident: ~$0.01-0.03 (1-3 cents)
Monthly estimates:
- Light usage (50 incidents): ~$0.50-1.50/month
- Medium usage (200 incidents): ~$2-6/month
- Heavy usage (1000 incidents): ~$10-30/month
Compare this to downtime costs: $14,056/minute for medium companies!
devops-agent-team/
├── agents/
│ ├── base_agent.py # Base class with OpenAI/Anthropic support
│ ├── commander_agent.py # Orchestrator & router
│ ├── log_analyzer_agent.py # Log analysis specialist
│ ├── database_agent.py # Database specialist
│ ├── network_agent.py # Network specialist
│ ├── memory_agent.py # Past incidents (ChromaDB)
│ ├── security_agent.py # Security & vulnerabilities
│ ├── performance_agent.py # Performance bottlenecks
│ └── __init__.py
├── tools/
│ └── base_tools.py # 5 production tools + execution engine
├── api/
│ ├── app.py # FastAPI application
│ ├── models.py # Pydantic request/response models
│ ├── database.py # SQLAlchemy models
│ ├── websocket.py # WebSocket manager
│ └── __init__.py
├── frontend/ # React dashboard
│ ├── src/
│ │ ├── components/
│ │ │ ├── Dashboard.jsx # Main dashboard
│ │ │ ├── IncidentForm.jsx # Incident submission
│ │ │ └── AgentCard.jsx # Agent status cards
│ │ ├── api.js # API client
│ │ ├── App.jsx
│ │ └── main.jsx
│ ├── package.json
│ └── vite.config.js
├── core/
│ ├── config.py # Configuration management
│ ├── logger.py # Rich logging setup
│ └── __init__.py
├── chroma_db/ # Vector database for Memory Agent
├── main.py # CLI interface
├── Dockerfile # Production container
├── docker-compose.yml # Orchestration
├── requirements.txt # Python dependencies
├── .env.example # Environment template
└── README.md # This file
Agents don't just analyze text - they gather real data by calling diagnostic tools:
Available Tools:
get_metrics(service, metric_type)- CPU, memory, latency, error ratesfetch_logs(service, level)- Recent error/warning logsquery_database(query_type)- Connection pools, slow queries, lockscheck_service_status(service)- Health checks, uptimesearch_documentation(query)- Internal docs and runbooks
How it works:
# Performance Agent autonomously decides to gather metrics
Agent: "I need to check CPU and memory before diagnosing"
Tool Call: get_metrics("api-service", "cpu")
Result: {"current": "87.3%", "threshold": "80%", "status": "WARNING"}
Agent: "High CPU detected. Let me fetch recent logs..."
Tool Call: fetch_logs("api-service", "error")
# Agent uses tool results to make informed diagnosisThe Commander Agent automatically determines which specialists to deploy based on incident keywords:
# Memory keywords trigger Memory Agent
"similar", "seen before", "past incident"
# Security keywords trigger Security Agent
"exposed", "leaked", "vulnerability", "CVE"
# Performance keywords trigger Performance Agent
"slow", "timeout", "high CPU", "memory leak"
# Database keywords trigger Database Agent
"connection pool", "deadlock", "query timeout"
# Network keywords trigger Network Agent
"SSL error", "DNS", "connection refused", "timeout"
# Always deploys Log Analyzer if error logs presentSwitch between OpenAI and Anthropic models:
# .env configuration
DEFAULT_MODEL=gpt-4o # Commander & Synthesizer
SPECIALIST_MODEL=gpt-4o-mini # Specialist agents
# Or use Claude
DEFAULT_MODEL=claude-3-5-sonnet-20241022
SPECIALIST_MODEL=claude-3-haiku-20240307- GPT-4o-mini for specialists ($0.15/$0.60 per million tokens) - Pattern matching, log parsing
- GPT-4o for synthesis ($2.50/$10.00 per million tokens) - Complex reasoning, root cause analysis
- Prompt caching - 50-90% cost reduction on repeated system prompts
Watch investigations unfold in real-time:
const ws = new WebSocket('ws://localhost:8000/ws/incident_123');
ws.onmessage = (event) => {
const update = JSON.parse(event.data);
console.log(`${update.agent}: ${update.message}`);
// "Memory Agent: Found 3 similar incidents"
// "Performance Agent: Calling get_metrics tool..."
// "Commander: Investigation complete!"
};- Real-time agent status updates
- Color-coded results (green = success, red = failure)
- Progress indicators and spinners
- Beautiful tables and panels
- Cost breakdown per agent
- ✅ Multi-agent architecture
- ✅ Commander orchestration
- ✅ 6 specialist agents (Log, Database, Network, Memory, Security, Performance)
- ✅ CLI interface
- ✅ Cost tracking
- ✅ Memory Agent with ChromaDB vector search
- ✅ Security Agent with secret scanning
- ✅ Performance Agent with bottleneck detection
- ✅ Autonomous tool calling framework (5 production tools)
- ✅ Dual OpenAI/Anthropic provider support
- ✅ FastAPI REST API with async processing
- ✅ WebSocket real-time updates
- ✅ SQLite database persistence
- ✅ Background task processing
- ✅ Incident history and analytics
- ✅ Docker containerization
- ✅ docker-compose orchestration
- ✅ React dashboard with Vite + TailwindCSS
- ✅ Real-time agent collaboration visualization
- ✅ WebSocket connection for live updates
- ✅ Tool call visualization in agent cards
- ✅ Incident submission form with presets
- ✅ Activity log with color-coded messages
- ✅ Final report display with recommendations
- Incident history browser and search
- Team collaboration features
- Integration with monitoring tools (DataDog, New Relic, Sentry)
- Slack/Discord bot integration
- Automated incident detection
- Preventive suggestions based on ML patterns
- Multi-tenant support
Better than single LLM:
- Each agent has specialized expertise and context
- Parallel processing (future: run agents concurrently)
- Cost optimization (cheap models for simple tasks)
- Easier to extend (add new specialist agents)
- More explainable (see each agent's reasoning)
Real-world analogy:
Single LLM = One person doing everything
Multi-Agent = Specialized team working together
(How real DevOps teams work!)
- Incident Input → Commander receives incident data
- Analysis → Commander determines required specialists
- Deployment → Specialist agents analyze in their domain
- Collection → Commander gathers all findings
- Synthesis → Synthesizer creates unified report
- Output → Actionable incident report
# GPT-4o (Commander & Synthesizer)
Input: $2.50 / million tokens
Output: $10.00 / million tokens
# GPT-4o-mini (Specialist Agents)
Input: $0.15 / million tokens
Output: $0.60 / million tokens
# Typical incident:
3 specialist agents (2000 input, 500 output each):
- Input: 3 × 2000 × $0.15/1M = $0.0009
- Output: 3 × 500 × $0.60/1M = $0.0009
1 Commander + 1 Synthesizer (3000 input, 800 output each):
- Input: 2 × 3000 × $2.50/1M = $0.015
- Output: 2 × 800 × $10.00/1M = $0.016
Total = ~$0.033 per investigation (3.3 cents)# Build and run with docker-compose
docker-compose up --build
# Run in background
docker-compose up -d
# View logs
docker-compose logs -f api
# Stop
docker-compose down
# Health check
curl http://localhost:8000/healthProduction considerations:
- Mount persistent volumes for
chroma_db/andincidents.db - Set API keys via environment variables
- Use reverse proxy (nginx) for SSL termination
- Scale with Kubernetes or Docker Swarm if needed
pytest tests/from agents.base_agent import BaseAgent
class MyNewAgent(BaseAgent):
def __init__(self):
super().__init__(name="My Agent", model="claude-3-haiku-20240307")
def get_system_prompt(self) -> str:
return """You are an expert in..."""
def get_tools(self) -> List[Dict]:
return [] # Tools your agent can useRegister in commander_agent.py:
self.my_agent = MyNewAgent()MIT License - see LICENSE file
Said Full-Stack & AI Developer
- 🎯 Purpose: Full-stack portfolio project showcasing production-ready multi-agent AI systems
- 🛠️ Skills Demonstrated:
- AI/ML: Multi-agent architecture, autonomous tool calling, LLM orchestration
- Backend: FastAPI, WebSocket real-time updates, SQLAlchemy ORM, async Python
- Frontend: React 18, Vite, TailwindCSS, WebSocket client, real-time state management
- APIs: OpenAI GPT-4o, Anthropic Claude 3.5, function calling for both providers
- Databases: ChromaDB vector database, SQLite persistence
- DevOps: Docker containerization, docker-compose orchestration
- Architecture: Clean code, separation of concerns, RESTful APIs
- Cost Optimization: Model selection, prompt caching, efficient token usage
- Real-world Problem Solving: Production incident response workflows
- OpenAI - GPT-4o and GPT-4o-mini models
- Anthropic - Claude 3.5 Sonnet & Claude 3 Haiku
- FastAPI - Modern async web framework
- ChromaDB - Vector database for embeddings
- Rich - Beautiful terminal UI
- Pydantic - Data validation
- SQLAlchemy - SQL toolkit and ORM
Built with ❤️ by Said
Turning AI agents into a DevOps dream team