DevOps Incident Response Team 🚀

A production-ready multi-agent AI system that analyzes incidents using 6 specialized AI agents with autonomous tool calling, powered by OpenAI GPT-4o and Claude 3.5.

🎯 Overview

DevOps Incident Response Team is an intelligent multi-agent system that mimics how a real DevOps team investigates production incidents. When you report an incident, a Commander Agent deploys specialist agents (Log Analyzer, Database Expert, Network Specialist, Memory, Security, Performance) to investigate. Each agent autonomously calls diagnostic tools to gather real data, then provides expert analysis. The Commander synthesizes their findings into an actionable incident report.

Key Features

🤖 6 Specialist Agents - Memory, Security, Performance, Log Analyzer, Database, Network experts
🛠️ Autonomous Tool Calling - Agents execute diagnostic tools to gather real production data
🧠 Intelligent Orchestration - Commander Agent automatically deploys relevant specialists
💰 Cost-Optimized - GPT-4o-mini for specialists, GPT-4o for synthesis (dual OpenAI/Claude support)
🌐 REST API + WebSocket - Production FastAPI backend with real-time updates
📊 Real-Time Insights - Beautiful terminal UI and WebSocket streaming
💡 Actionable Reports - Root cause analysis with prioritized action items
🐳 Docker Ready - Full containerization with docker-compose orchestration

🏗️ Architecture

Incident Reported (REST API or CLI)
       ↓
┌──────────────────┐
│ Commander Agent  │  ← Orchestrates the team
│  (GPT-4o)        │
└──────────────────┘
       ↓
   Analyzes incident
   Deploys specialists
       ↓
┌─────────────────────────────────────────────────────────┐
│            6 Specialist Agents (GPT-4o-mini)            │
│                                                         │
│  ┌──────────────┐  ┌───────────────┐  ┌─────────────┐ │
│  │ Log Analyzer │  │ Database Agent│  │Network Agent│ │
│  │  + Tools     │  │   + Tools     │  │  + Tools    │ │
│  └──────────────┘  └───────────────┘  └─────────────┘ │
│                                                         │
│  ┌──────────────┐  ┌───────────────┐  ┌─────────────┐ │
│  │Memory Agent  │  │Security Agent │  │Performance  │ │
│  │ (ChromaDB)   │  │  + Tools      │  │   Agent     │ │
│  └──────────────┘  └───────────────┘  └─────────────┘ │
│                                                         │
│     Each agent can call tools autonomously:            │
│     - get_metrics, fetch_logs, query_database          │
│     - check_service_status, search_documentation       │
└─────────────────────────────────────────────────────────┘
       ↓
   Results collected via WebSocket
       ↓
┌──────────────────┐
│   Synthesizer    │  ← Combines findings
│  (GPT-4o)        │
└──────────────────┘
       ↓
  Incident Report (JSON + Real-time streaming)
  - Root Cause
  - Action Items
  - Confidence Score
  - Cost & Metrics

Agent Roles

Agent	Expertise	Model	Tools	Purpose
Commander	Orchestration	GPT-4o	None	Routes incident to specialists, coordinates investigation
Log Analyzer	Logs & Stack Traces	GPT-4o-mini	fetch_logs	Parses error messages, identifies patterns
Database Agent	Database Issues	GPT-4o-mini	query_database	Connection pools, queries, deadlocks
Network Agent	Connectivity	GPT-4o-mini	check_service_status	DNS, SSL, timeouts, HTTP errors
Memory Agent	Past Incidents	GPT-4o-mini	ChromaDB	Finds similar incidents, learns from history
Security Agent	Vulnerabilities	GPT-4o-mini	secret_scanner	Exposed secrets, OWASP Top 10, CVEs
Performance Agent	Bottlenecks	GPT-4o-mini	get_metrics	CPU, memory, latency, throughput analysis
Synthesizer	Root Cause Analysis	GPT-4o	None	Combines all findings into actionable report

🚀 Quick Start

Prerequisites

Python 3.12+
OpenAI API key (get one here) - Primary provider
Anthropic API key (get one here) - Optional secondary provider

Installation

# Clone repository
git clone https://github.com/yourusername/devops-agent-team.git
cd devops-agent-team

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY (and optionally ANTHROPIC_API_KEY)

Option 1: Run via CLI

python main.py

Option 2: Run via REST API

# Start the FastAPI server
uvicorn api.app:app --host 0.0.0.0 --port 8000

# Or use Docker
docker-compose up --build

Access the API:

Swagger docs: http://localhost:8000/docs
Health check: http://localhost:8000/health
WebSocket: ws://localhost:8000/ws/{incident_id}

Option 3: Run with React Frontend

# Terminal 1: Start backend
uvicorn api.app:app --host 0.0.0.0 --port 8000

# Terminal 2: Start frontend
cd frontend
npm install
npm run dev

Access the dashboard: http://localhost:3000

📖 Example Investigation

Input: Database Connection Pool Exhaustion

Description: API endpoint returning 500 errors

Error Message:
[ERROR] Database connection failed
psycopg2.pool.PoolExhaustedError: connection pool exhausted (size: 10, max: 10)
[WARN] Queue backing up: 145 pending requests

Output: AI Agent Investigation

🚨 INCIDENT DETECTED
Description: API endpoint returning 500 errors
Time: 2025-01-15 10:23:45

🤖 Deploying 2 specialist agents

┌─ Log Analyzer ──────────────────────────────┐
│ ✅ Analysis Complete                        │
│ Confidence: 90% | Cost: $0.0023 | Time: 1.2s│
│ Suggestions: 3                              │
└─────────────────────────────────────────────┘

┌─ Database Agent ────────────────────────────┐
│ ✅ Analysis Complete                        │
│ Confidence: 95% | Cost: $0.0019 | Time: 1.5s│
│ Suggestions: 4                              │
└─────────────────────────────────────────────┘

🧠 Synthesizing findings from all agents...

📋 INCIDENT REPORT

Root Cause:
Database connection pool is undersized for current load.
The pool is configured with only 10 connections, which is
being exhausted during peak traffic, causing API requests
to fail.

Recommendations:
1. Immediately increase pool size from 10 to 50 connections
2. Configure pool timeout to 30 seconds with proper error handling
3. Add connection pool monitoring and alerting
4. Review slow queries that might be holding connections
5. Implement connection retry logic with exponential backoff

Total Cost: $0.0158 | Investigation Time: 4.2s

💰 Cost Analysis

Typical investigation costs:

Scenario	Agents Deployed	Total Cost	Time
Simple log error	1 (Log Analyzer)	$0.003	1-2s
Database issue	2 (Log + Database)	$0.008	2-3s
Complex multi-layer	3-4 agents + Synthesis	$0.015-0.025	4-6s

Per incident: ~$0.01-0.03 (1-3 cents)

Monthly estimates:

Light usage (50 incidents): ~$0.50-1.50/month
Medium usage (200 incidents): ~$2-6/month
Heavy usage (1000 incidents): ~$10-30/month

Compare this to downtime costs: $14,056/minute for medium companies!

🛠️ Project Structure

devops-agent-team/
├── agents/
│   ├── base_agent.py              # Base class with OpenAI/Anthropic support
│   ├── commander_agent.py         # Orchestrator & router
│   ├── log_analyzer_agent.py      # Log analysis specialist
│   ├── database_agent.py          # Database specialist
│   ├── network_agent.py           # Network specialist
│   ├── memory_agent.py            # Past incidents (ChromaDB)
│   ├── security_agent.py          # Security & vulnerabilities
│   ├── performance_agent.py       # Performance bottlenecks
│   └── __init__.py
├── tools/
│   └── base_tools.py              # 5 production tools + execution engine
├── api/
│   ├── app.py                     # FastAPI application
│   ├── models.py                  # Pydantic request/response models
│   ├── database.py                # SQLAlchemy models
│   ├── websocket.py               # WebSocket manager
│   └── __init__.py
├── frontend/                      # React dashboard
│   ├── src/
│   │   ├── components/
│   │   │   ├── Dashboard.jsx      # Main dashboard
│   │   │   ├── IncidentForm.jsx   # Incident submission
│   │   │   └── AgentCard.jsx      # Agent status cards
│   │   ├── api.js                 # API client
│   │   ├── App.jsx
│   │   └── main.jsx
│   ├── package.json
│   └── vite.config.js
├── core/
│   ├── config.py                  # Configuration management
│   ├── logger.py                  # Rich logging setup
│   └── __init__.py
├── chroma_db/                     # Vector database for Memory Agent
├── main.py                        # CLI interface
├── Dockerfile                     # Production container
├── docker-compose.yml             # Orchestration
├── requirements.txt               # Python dependencies
├── .env.example                   # Environment template
└── README.md                      # This file

🎨 Features in Detail

1. Autonomous Tool Calling

Agents don't just analyze text - they gather real data by calling diagnostic tools:

Available Tools:

get_metrics(service, metric_type) - CPU, memory, latency, error rates
fetch_logs(service, level) - Recent error/warning logs
query_database(query_type) - Connection pools, slow queries, locks
check_service_status(service) - Health checks, uptime
search_documentation(query) - Internal docs and runbooks

How it works:

# Performance Agent autonomously decides to gather metrics
Agent: "I need to check CPU and memory before diagnosing"
Tool Call: get_metrics("api-service", "cpu")
Result: {"current": "87.3%", "threshold": "80%", "status": "WARNING"}
Agent: "High CPU detected. Let me fetch recent logs..."
Tool Call: fetch_logs("api-service", "error")
# Agent uses tool results to make informed diagnosis

2. Intelligent Agent Selection

The Commander Agent automatically determines which specialists to deploy based on incident keywords:

# Memory keywords trigger Memory Agent
"similar", "seen before", "past incident"

# Security keywords trigger Security Agent
"exposed", "leaked", "vulnerability", "CVE"

# Performance keywords trigger Performance Agent
"slow", "timeout", "high CPU", "memory leak"

# Database keywords trigger Database Agent
"connection pool", "deadlock", "query timeout"

# Network keywords trigger Network Agent
"SSL error", "DNS", "connection refused", "timeout"

# Always deploys Log Analyzer if error logs present

3. Dual Provider Support

Switch between OpenAI and Anthropic models:

# .env configuration
DEFAULT_MODEL=gpt-4o              # Commander & Synthesizer
SPECIALIST_MODEL=gpt-4o-mini      # Specialist agents

# Or use Claude
DEFAULT_MODEL=claude-3-5-sonnet-20241022
SPECIALIST_MODEL=claude-3-haiku-20240307

4. Cost Optimization

GPT-4o-mini for specialists ($0.15/$0.60 per million tokens) - Pattern matching, log parsing
GPT-4o for synthesis ($2.50/$10.00 per million tokens) - Complex reasoning, root cause analysis
Prompt caching - 50-90% cost reduction on repeated system prompts

5. Real-Time WebSocket Updates

Watch investigations unfold in real-time:

const ws = new WebSocket('ws://localhost:8000/ws/incident_123');
ws.onmessage = (event) => {
  const update = JSON.parse(event.data);
  console.log(`${update.agent}: ${update.message}`);
  // "Memory Agent: Found 3 similar incidents"
  // "Performance Agent: Calling get_metrics tool..."
  // "Commander: Investigation complete!"
};

6. Rich Terminal UI

Real-time agent status updates
Color-coded results (green = success, red = failure)
Progress indicators and spinners
Beautiful tables and panels
Cost breakdown per agent

🔜 Roadmap

✅ Phase 1: Core System (COMPLETE)

✅ Multi-agent architecture
✅ Commander orchestration
✅ 6 specialist agents (Log, Database, Network, Memory, Security, Performance)
✅ CLI interface
✅ Cost tracking

✅ Phase 2: Enhanced Agents (COMPLETE)

✅ Memory Agent with ChromaDB vector search
✅ Security Agent with secret scanning
✅ Performance Agent with bottleneck detection
✅ Autonomous tool calling framework (5 production tools)
✅ Dual OpenAI/Anthropic provider support

✅ Phase 3: Backend (COMPLETE)

✅ FastAPI REST API with async processing
✅ WebSocket real-time updates
✅ SQLite database persistence
✅ Background task processing
✅ Incident history and analytics
✅ Docker containerization
✅ docker-compose orchestration

✅ Phase 4: Frontend (COMPLETE)

✅ React dashboard with Vite + TailwindCSS
✅ Real-time agent collaboration visualization
✅ WebSocket connection for live updates
✅ Tool call visualization in agent cards
✅ Incident submission form with presets
✅ Activity log with color-coded messages
✅ Final report display with recommendations

Phase 5: Integrations & Advanced Features (Future)

Incident history browser and search
Team collaboration features
Integration with monitoring tools (DataDog, New Relic, Sentry)
Slack/Discord bot integration
Automated incident detection
Preventive suggestions based on ML patterns
Multi-tenant support

📚 Technical Deep Dive

Why Multi-Agent Architecture?

Better than single LLM:

Each agent has specialized expertise and context
Parallel processing (future: run agents concurrently)
Cost optimization (cheap models for simple tasks)
Easier to extend (add new specialist agents)
More explainable (see each agent's reasoning)

Real-world analogy:

Single LLM  = One person doing everything
Multi-Agent = Specialized team working together
              (How real DevOps teams work!)

Agent Communication Flow

Incident Input → Commander receives incident data
Analysis → Commander determines required specialists
Deployment → Specialist agents analyze in their domain
Collection → Commander gathers all findings
Synthesis → Synthesizer creates unified report
Output → Actionable incident report

Cost Calculation

# GPT-4o (Commander & Synthesizer)
Input:  $2.50 / million tokens
Output: $10.00 / million tokens

# GPT-4o-mini (Specialist Agents)
Input:  $0.15 / million tokens
Output: $0.60 / million tokens

# Typical incident:
3 specialist agents (2000 input, 500 output each):
  - Input:  3 × 2000 × $0.15/1M  = $0.0009
  - Output: 3 × 500  × $0.60/1M  = $0.0009

1 Commander + 1 Synthesizer (3000 input, 800 output each):
  - Input:  2 × 3000 × $2.50/1M  = $0.015
  - Output: 2 × 800  × $10.00/1M = $0.016

Total = ~$0.033 per investigation (3.3 cents)

Docker Deployment

# Build and run with docker-compose
docker-compose up --build

# Run in background
docker-compose up -d

# View logs
docker-compose logs -f api

# Stop
docker-compose down

# Health check
curl http://localhost:8000/health

Production considerations:

Mount persistent volumes for chroma_db/ and incidents.db
Set API keys via environment variables
Use reverse proxy (nginx) for SSL termination
Scale with Kubernetes or Docker Swarm if needed

🧪 Development

Running Tests

pytest tests/

Adding a New Specialist Agent

from agents.base_agent import BaseAgent

class MyNewAgent(BaseAgent):
    def __init__(self):
        super().__init__(name="My Agent", model="claude-3-haiku-20240307")

    def get_system_prompt(self) -> str:
        return """You are an expert in..."""

    def get_tools(self) -> List[Dict]:
        return []  # Tools your agent can use

Register in commander_agent.py:

self.my_agent = MyNewAgent()

📄 License

MIT License - see LICENSE file

👤 Author

Said Full-Stack & AI Developer

🎯 Purpose: Full-stack portfolio project showcasing production-ready multi-agent AI systems
🛠️ Skills Demonstrated:
- AI/ML: Multi-agent architecture, autonomous tool calling, LLM orchestration
- Backend: FastAPI, WebSocket real-time updates, SQLAlchemy ORM, async Python
- Frontend: React 18, Vite, TailwindCSS, WebSocket client, real-time state management
- APIs: OpenAI GPT-4o, Anthropic Claude 3.5, function calling for both providers
- Databases: ChromaDB vector database, SQLite persistence
- DevOps: Docker containerization, docker-compose orchestration
- Architecture: Clean code, separation of concerns, RESTful APIs
- Cost Optimization: Model selection, prompt caching, efficient token usage
- Real-world Problem Solving: Production incident response workflows

🙏 Acknowledgments

OpenAI - GPT-4o and GPT-4o-mini models
Anthropic - Claude 3.5 Sonnet & Claude 3 Haiku
FastAPI - Modern async web framework
ChromaDB - Vector database for embeddings
Rich - Beautiful terminal UI
Pydantic - Data validation
SQLAlchemy - SQL toolkit and ORM

Built with ❤️ by Said

Turning AI agents into a DevOps dream team

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
agents		agents
api		api
core		core
frontend		frontend
tools		tools
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt

MazadiaS/devops-agent-team

Folders and files

Latest commit

History

Repository files navigation