Skip to content

AI-powered incident response system with 6 specialist agents

Notifications You must be signed in to change notification settings

MazadiaS/devops-agent-team

Repository files navigation

DevOps Incident Response Team 🚀

A production-ready multi-agent AI system that analyzes incidents using 6 specialized AI agents with autonomous tool calling, powered by OpenAI GPT-4o and Claude 3.5.

License Python AI FastAPI Docker

🎯 Overview

DevOps Incident Response Team is an intelligent multi-agent system that mimics how a real DevOps team investigates production incidents. When you report an incident, a Commander Agent deploys specialist agents (Log Analyzer, Database Expert, Network Specialist, Memory, Security, Performance) to investigate. Each agent autonomously calls diagnostic tools to gather real data, then provides expert analysis. The Commander synthesizes their findings into an actionable incident report.

Key Features

  • 🤖 6 Specialist Agents - Memory, Security, Performance, Log Analyzer, Database, Network experts
  • 🛠️ Autonomous Tool Calling - Agents execute diagnostic tools to gather real production data
  • 🧠 Intelligent Orchestration - Commander Agent automatically deploys relevant specialists
  • 💰 Cost-Optimized - GPT-4o-mini for specialists, GPT-4o for synthesis (dual OpenAI/Claude support)
  • 🌐 REST API + WebSocket - Production FastAPI backend with real-time updates
  • 📊 Real-Time Insights - Beautiful terminal UI and WebSocket streaming
  • 💡 Actionable Reports - Root cause analysis with prioritized action items
  • 🐳 Docker Ready - Full containerization with docker-compose orchestration

🏗️ Architecture

Incident Reported (REST API or CLI)
       ↓
┌──────────────────┐
│ Commander Agent  │  ← Orchestrates the team
│  (GPT-4o)        │
└──────────────────┘
       ↓
   Analyzes incident
   Deploys specialists
       ↓
┌─────────────────────────────────────────────────────────┐
│            6 Specialist Agents (GPT-4o-mini)            │
│                                                         │
│  ┌──────────────┐  ┌───────────────┐  ┌─────────────┐ │
│  │ Log Analyzer │  │ Database Agent│  │Network Agent│ │
│  │  + Tools     │  │   + Tools     │  │  + Tools    │ │
│  └──────────────┘  └───────────────┘  └─────────────┘ │
│                                                         │
│  ┌──────────────┐  ┌───────────────┐  ┌─────────────┐ │
│  │Memory Agent  │  │Security Agent │  │Performance  │ │
│  │ (ChromaDB)   │  │  + Tools      │  │   Agent     │ │
│  └──────────────┘  └───────────────┘  └─────────────┘ │
│                                                         │
│     Each agent can call tools autonomously:            │
│     - get_metrics, fetch_logs, query_database          │
│     - check_service_status, search_documentation       │
└─────────────────────────────────────────────────────────┘
       ↓
   Results collected via WebSocket
       ↓
┌──────────────────┐
│   Synthesizer    │  ← Combines findings
│  (GPT-4o)        │
└──────────────────┘
       ↓
  Incident Report (JSON + Real-time streaming)
  - Root Cause
  - Action Items
  - Confidence Score
  - Cost & Metrics

Agent Roles

Agent Expertise Model Tools Purpose
Commander Orchestration GPT-4o None Routes incident to specialists, coordinates investigation
Log Analyzer Logs & Stack Traces GPT-4o-mini fetch_logs Parses error messages, identifies patterns
Database Agent Database Issues GPT-4o-mini query_database Connection pools, queries, deadlocks
Network Agent Connectivity GPT-4o-mini check_service_status DNS, SSL, timeouts, HTTP errors
Memory Agent Past Incidents GPT-4o-mini ChromaDB Finds similar incidents, learns from history
Security Agent Vulnerabilities GPT-4o-mini secret_scanner Exposed secrets, OWASP Top 10, CVEs
Performance Agent Bottlenecks GPT-4o-mini get_metrics CPU, memory, latency, throughput analysis
Synthesizer Root Cause Analysis GPT-4o None Combines all findings into actionable report

🚀 Quick Start

Prerequisites

  • Python 3.12+
  • OpenAI API key (get one here) - Primary provider
  • Anthropic API key (get one here) - Optional secondary provider

Installation

# Clone repository
git clone https://github.com/yourusername/devops-agent-team.git
cd devops-agent-team

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY (and optionally ANTHROPIC_API_KEY)

Option 1: Run via CLI

python main.py

Option 2: Run via REST API

# Start the FastAPI server
uvicorn api.app:app --host 0.0.0.0 --port 8000

# Or use Docker
docker-compose up --build

Access the API:

Option 3: Run with React Frontend

# Terminal 1: Start backend
uvicorn api.app:app --host 0.0.0.0 --port 8000

# Terminal 2: Start frontend
cd frontend
npm install
npm run dev

Access the dashboard: http://localhost:3000


📖 Example Investigation

Input: Database Connection Pool Exhaustion

Description: API endpoint returning 500 errors

Error Message:
[ERROR] Database connection failed
psycopg2.pool.PoolExhaustedError: connection pool exhausted (size: 10, max: 10)
[WARN] Queue backing up: 145 pending requests

Output: AI Agent Investigation

🚨 INCIDENT DETECTED
Description: API endpoint returning 500 errors
Time: 2025-01-15 10:23:45

🤖 Deploying 2 specialist agents

┌─ Log Analyzer ──────────────────────────────┐
│ ✅ Analysis Complete                        │
│ Confidence: 90% | Cost: $0.0023 | Time: 1.2s│
│ Suggestions: 3                              │
└─────────────────────────────────────────────┘

┌─ Database Agent ────────────────────────────┐
│ ✅ Analysis Complete                        │
│ Confidence: 95% | Cost: $0.0019 | Time: 1.5s│
│ Suggestions: 4                              │
└─────────────────────────────────────────────┘

🧠 Synthesizing findings from all agents...

📋 INCIDENT REPORT

Root Cause:
Database connection pool is undersized for current load.
The pool is configured with only 10 connections, which is
being exhausted during peak traffic, causing API requests
to fail.

Recommendations:
1. Immediately increase pool size from 10 to 50 connections
2. Configure pool timeout to 30 seconds with proper error handling
3. Add connection pool monitoring and alerting
4. Review slow queries that might be holding connections
5. Implement connection retry logic with exponential backoff

Total Cost: $0.0158 | Investigation Time: 4.2s

💰 Cost Analysis

Typical investigation costs:

Scenario Agents Deployed Total Cost Time
Simple log error 1 (Log Analyzer) $0.003 1-2s
Database issue 2 (Log + Database) $0.008 2-3s
Complex multi-layer 3-4 agents + Synthesis $0.015-0.025 4-6s

Per incident: ~$0.01-0.03 (1-3 cents)

Monthly estimates:

  • Light usage (50 incidents): ~$0.50-1.50/month
  • Medium usage (200 incidents): ~$2-6/month
  • Heavy usage (1000 incidents): ~$10-30/month

Compare this to downtime costs: $14,056/minute for medium companies!


🛠️ Project Structure

devops-agent-team/
├── agents/
│   ├── base_agent.py              # Base class with OpenAI/Anthropic support
│   ├── commander_agent.py         # Orchestrator & router
│   ├── log_analyzer_agent.py      # Log analysis specialist
│   ├── database_agent.py          # Database specialist
│   ├── network_agent.py           # Network specialist
│   ├── memory_agent.py            # Past incidents (ChromaDB)
│   ├── security_agent.py          # Security & vulnerabilities
│   ├── performance_agent.py       # Performance bottlenecks
│   └── __init__.py
├── tools/
│   └── base_tools.py              # 5 production tools + execution engine
├── api/
│   ├── app.py                     # FastAPI application
│   ├── models.py                  # Pydantic request/response models
│   ├── database.py                # SQLAlchemy models
│   ├── websocket.py               # WebSocket manager
│   └── __init__.py
├── frontend/                      # React dashboard
│   ├── src/
│   │   ├── components/
│   │   │   ├── Dashboard.jsx      # Main dashboard
│   │   │   ├── IncidentForm.jsx   # Incident submission
│   │   │   └── AgentCard.jsx      # Agent status cards
│   │   ├── api.js                 # API client
│   │   ├── App.jsx
│   │   └── main.jsx
│   ├── package.json
│   └── vite.config.js
├── core/
│   ├── config.py                  # Configuration management
│   ├── logger.py                  # Rich logging setup
│   └── __init__.py
├── chroma_db/                     # Vector database for Memory Agent
├── main.py                        # CLI interface
├── Dockerfile                     # Production container
├── docker-compose.yml             # Orchestration
├── requirements.txt               # Python dependencies
├── .env.example                   # Environment template
└── README.md                      # This file

🎨 Features in Detail

1. Autonomous Tool Calling

Agents don't just analyze text - they gather real data by calling diagnostic tools:

Available Tools:

  • get_metrics(service, metric_type) - CPU, memory, latency, error rates
  • fetch_logs(service, level) - Recent error/warning logs
  • query_database(query_type) - Connection pools, slow queries, locks
  • check_service_status(service) - Health checks, uptime
  • search_documentation(query) - Internal docs and runbooks

How it works:

# Performance Agent autonomously decides to gather metrics
Agent: "I need to check CPU and memory before diagnosing"
Tool Call: get_metrics("api-service", "cpu")
Result: {"current": "87.3%", "threshold": "80%", "status": "WARNING"}
Agent: "High CPU detected. Let me fetch recent logs..."
Tool Call: fetch_logs("api-service", "error")
# Agent uses tool results to make informed diagnosis

2. Intelligent Agent Selection

The Commander Agent automatically determines which specialists to deploy based on incident keywords:

# Memory keywords trigger Memory Agent
"similar", "seen before", "past incident"

# Security keywords trigger Security Agent
"exposed", "leaked", "vulnerability", "CVE"

# Performance keywords trigger Performance Agent
"slow", "timeout", "high CPU", "memory leak"

# Database keywords trigger Database Agent
"connection pool", "deadlock", "query timeout"

# Network keywords trigger Network Agent
"SSL error", "DNS", "connection refused", "timeout"

# Always deploys Log Analyzer if error logs present

3. Dual Provider Support

Switch between OpenAI and Anthropic models:

# .env configuration
DEFAULT_MODEL=gpt-4o              # Commander & Synthesizer
SPECIALIST_MODEL=gpt-4o-mini      # Specialist agents

# Or use Claude
DEFAULT_MODEL=claude-3-5-sonnet-20241022
SPECIALIST_MODEL=claude-3-haiku-20240307

4. Cost Optimization

  • GPT-4o-mini for specialists ($0.15/$0.60 per million tokens) - Pattern matching, log parsing
  • GPT-4o for synthesis ($2.50/$10.00 per million tokens) - Complex reasoning, root cause analysis
  • Prompt caching - 50-90% cost reduction on repeated system prompts

5. Real-Time WebSocket Updates

Watch investigations unfold in real-time:

const ws = new WebSocket('ws://localhost:8000/ws/incident_123');
ws.onmessage = (event) => {
  const update = JSON.parse(event.data);
  console.log(`${update.agent}: ${update.message}`);
  // "Memory Agent: Found 3 similar incidents"
  // "Performance Agent: Calling get_metrics tool..."
  // "Commander: Investigation complete!"
};

6. Rich Terminal UI

  • Real-time agent status updates
  • Color-coded results (green = success, red = failure)
  • Progress indicators and spinners
  • Beautiful tables and panels
  • Cost breakdown per agent

🔜 Roadmap

✅ Phase 1: Core System (COMPLETE)

  • ✅ Multi-agent architecture
  • ✅ Commander orchestration
  • ✅ 6 specialist agents (Log, Database, Network, Memory, Security, Performance)
  • ✅ CLI interface
  • ✅ Cost tracking

✅ Phase 2: Enhanced Agents (COMPLETE)

  • ✅ Memory Agent with ChromaDB vector search
  • ✅ Security Agent with secret scanning
  • ✅ Performance Agent with bottleneck detection
  • ✅ Autonomous tool calling framework (5 production tools)
  • ✅ Dual OpenAI/Anthropic provider support

✅ Phase 3: Backend (COMPLETE)

  • ✅ FastAPI REST API with async processing
  • ✅ WebSocket real-time updates
  • ✅ SQLite database persistence
  • ✅ Background task processing
  • ✅ Incident history and analytics
  • ✅ Docker containerization
  • ✅ docker-compose orchestration

✅ Phase 4: Frontend (COMPLETE)

  • ✅ React dashboard with Vite + TailwindCSS
  • ✅ Real-time agent collaboration visualization
  • ✅ WebSocket connection for live updates
  • ✅ Tool call visualization in agent cards
  • ✅ Incident submission form with presets
  • ✅ Activity log with color-coded messages
  • ✅ Final report display with recommendations

Phase 5: Integrations & Advanced Features (Future)

  • Incident history browser and search
  • Team collaboration features
  • Integration with monitoring tools (DataDog, New Relic, Sentry)
  • Slack/Discord bot integration
  • Automated incident detection
  • Preventive suggestions based on ML patterns
  • Multi-tenant support

📚 Technical Deep Dive

Why Multi-Agent Architecture?

Better than single LLM:

  • Each agent has specialized expertise and context
  • Parallel processing (future: run agents concurrently)
  • Cost optimization (cheap models for simple tasks)
  • Easier to extend (add new specialist agents)
  • More explainable (see each agent's reasoning)

Real-world analogy:

Single LLM  = One person doing everything
Multi-Agent = Specialized team working together
              (How real DevOps teams work!)

Agent Communication Flow

  1. Incident Input → Commander receives incident data
  2. Analysis → Commander determines required specialists
  3. Deployment → Specialist agents analyze in their domain
  4. Collection → Commander gathers all findings
  5. Synthesis → Synthesizer creates unified report
  6. Output → Actionable incident report

Cost Calculation

# GPT-4o (Commander & Synthesizer)
Input:  $2.50 / million tokens
Output: $10.00 / million tokens

# GPT-4o-mini (Specialist Agents)
Input:  $0.15 / million tokens
Output: $0.60 / million tokens

# Typical incident:
3 specialist agents (2000 input, 500 output each):
  - Input:  3 × 2000 × $0.15/1M  = $0.0009
  - Output: 3 × 500  × $0.60/1M  = $0.0009

1 Commander + 1 Synthesizer (3000 input, 800 output each):
  - Input:  2 × 3000 × $2.50/1M  = $0.015
  - Output: 2 × 800  × $10.00/1M = $0.016

Total = ~$0.033 per investigation (3.3 cents)

Docker Deployment

# Build and run with docker-compose
docker-compose up --build

# Run in background
docker-compose up -d

# View logs
docker-compose logs -f api

# Stop
docker-compose down

# Health check
curl http://localhost:8000/health

Production considerations:

  • Mount persistent volumes for chroma_db/ and incidents.db
  • Set API keys via environment variables
  • Use reverse proxy (nginx) for SSL termination
  • Scale with Kubernetes or Docker Swarm if needed

🧪 Development

Running Tests

pytest tests/

Adding a New Specialist Agent

from agents.base_agent import BaseAgent

class MyNewAgent(BaseAgent):
    def __init__(self):
        super().__init__(name="My Agent", model="claude-3-haiku-20240307")

    def get_system_prompt(self) -> str:
        return """You are an expert in..."""

    def get_tools(self) -> List[Dict]:
        return []  # Tools your agent can use

Register in commander_agent.py:

self.my_agent = MyNewAgent()

📄 License

MIT License - see LICENSE file


👤 Author

Said Full-Stack & AI Developer

  • 🎯 Purpose: Full-stack portfolio project showcasing production-ready multi-agent AI systems
  • 🛠️ Skills Demonstrated:
    • AI/ML: Multi-agent architecture, autonomous tool calling, LLM orchestration
    • Backend: FastAPI, WebSocket real-time updates, SQLAlchemy ORM, async Python
    • Frontend: React 18, Vite, TailwindCSS, WebSocket client, real-time state management
    • APIs: OpenAI GPT-4o, Anthropic Claude 3.5, function calling for both providers
    • Databases: ChromaDB vector database, SQLite persistence
    • DevOps: Docker containerization, docker-compose orchestration
    • Architecture: Clean code, separation of concerns, RESTful APIs
    • Cost Optimization: Model selection, prompt caching, efficient token usage
    • Real-world Problem Solving: Production incident response workflows

🙏 Acknowledgments


Built with ❤️ by Said

Turning AI agents into a DevOps dream team

About

AI-powered incident response system with 6 specialist agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published