CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

AIOpsLab is a holistic framework for evaluating autonomous AIOps agents in interactive cloud environments. It orchestrates microservice deployments, fault injection, workload generation, and telemetry collection to benchmark AI agents on detection, localization, analysis, and mitigation tasks.

Development Commands

Environment Setup

# Install dependencies using Poetry (recommended, requires Python >= 3.11)
poetry env use python3.11
poetry install
eval $(poetry env activate)

# Alternative: pip install
pip install -e .

Running Agents Locally

# Interactive CLI with human agent
python3 cli.py
# In the REPL:
(aiopslab) $ start <problem_id>  # e.g., misconfig_app_hotel_res-detection-1
(aiopslab) $ submit("Yes")

# Run baseline agents (GPT, Qwen, DeepSeek, etc.)
python3 clients/gpt.py
python3 clients/qwen.py
python3 clients/deepseek.py
python3 clients/vllm.py

# Configure API keys in .env file
cp .env.example .env
# Edit .env to add OPENAI_API_KEY, QWEN_API_KEY, etc.

Running AIOpsLab as a Service

# Start FastAPI service on remote machine
SERVICE_HOST=0.0.0.0 SERVICE_PORT=1818 SERVICE_WORKERS=1 python service.py

# Test endpoints
curl http://<host>:<port>/health
curl http://<host>:<port>/problems
curl http://<host>:<port>/agents

# Run simulation via API
curl -X POST http://<host>:<port>/simulate \
  -H "Content-Type: application/json" \
  -d '{"problem_id": "misconfig_app_hotel_res-mitigation-1", "agent_name": "vllm", "max_steps": 10}'

Testing

# Run unit tests
python -m unittest discover tests/

# Run specific test files
python -m unittest tests.parser.test_parser
python -m unittest tests.registry.test_get_actions

Cluster Setup

Option 1: Local simulated cluster (kind)

kind create cluster --config kind/kind-config-x86.yaml   # For x86
kind create cluster --config kind/kind-config-arm.yaml   # For ARM

# Configure AIOpsLab
cd aiopslab
cp config.yml.example config.yml
# Edit config.yml: set k8s_host to localhost or kind

Option 2: Azure VMs with Terraform + Ansible

See section: Azure Deployment with Terraform + Ansible

Cluster Management

# Use k9s for monitoring (recommended)
k9s

# Or use kubectl directly
kubectl get pods -n <namespace>
kubectl logs <pod-name> -n <namespace>

Core Architecture

Orchestrator Architecture

Central Coordination Flow:

Orchestrator
  ├── init_problem() → deploys app, injects fault, starts workload
  ├── start_problem() → agent-environment interaction loop
  │     ├── ask_agent() → get next action from agent
  │     ├── ask_env() → execute action via session.problem
  │     └── returns observation to agent
  └── Evaluates results and saves session

Key Components:

Orchestrator (aiopslab/orchestrator/orchestrator.py): Central coordinator managing lifecycle from initialization to evaluation
Session (aiopslab/session.py): Tracks agent-environment interaction history, timing, results; supports JSON export and W&B logging
ResponseParser (aiopslab/orchestrator/parser.py): Extracts API calls from agent responses using AST parsing for complex arguments

Problem Structure

Problems inherit from Task subclasses (Detection, Localization, Analysis, Mitigation) and define:

inject_fault(): Injects problem using fault injector
recover_fault(): Cleans up injected fault
start_workload(): Initiates load generation (can be sync or async)
eval(soln, trace, duration): Evaluates agent solution
get_task_description(): Human/LLM-readable description
get_instructions(): API usage instructions
get_available_actions(): Dict of available APIs with docstrings

Problem ID Format: <problem_type>-<task_type>-<variant>

Example: pod_failure_hotel_res-detection-1
Example: k8s_target_port-misconfig-mitigation-2

Problem Registry (aiopslab/orchestrator/problems/registry.py):

Centralized mapping of problem IDs to instances
60+ problems across multiple categories
Access via orch.probs.get_problem_ids() or list at /problems endpoint

Agent Integration

Required Agent Interface:

class YourAgent:
    def init_context(self, problem_desc: str, instructions: str, apis: dict):
        """Initialize agent with problem context"""
        pass

    async def get_action(self, observation: str) -> str:
        """Return next action as markdown code block"""
        return "Action:\n```\napi_name(args)\n```"

Agent Lifecycle:

Create agent instance with model parameters
Register with orchestrator: orch.register_agent(agent, name="agent-name")
Initialize problem: problem_desc, instructs, apis = orch.init_problem(problem_id)
Set agent context: agent.init_context(problem_desc, instructs, apis)
Start problem: await orch.start_problem(max_steps=30)

Agent Registry (clients/registry.py):

Maps agent names to implementations
Supported: GPT, Qwen, DeepSeek, vLLM, OpenRouter, Groq
Access via AgentRegistry().get_agent_ids()

Action System

Action Categories by Task:

Detection: get_logs, get_metrics, get_traces, exec_shell, submit
Localization: Same as detection + targeted analysis
Mitigation: All above + scaling, config patching, restarts
Analysis: Comprehensive debugging capabilities

Action Decorators:

@action  # Standard action
@read    # Read-only (no state change)
@write   # State-modifying action

Actions are dynamically discovered via these decorators and exposed to agents with auto-generated documentation from docstrings.

Action Execution:

Agent response parsed by ResponseParser
Routed via Task.perform_action(api_name, *args, **kwargs)
Returns string observation or error
Invalid actions raise InvalidActionError

Service Layer

Application Base (aiopslab/service/apps/base.py):

Abstract interface for all applications
Loads metadata from JSON (aiopslab/service/metadata/)
Key attributes: namespace, helm_configs, k8s_deploy_path, docker_deploy_path
Methods: load_app_json(), get_app_summary(), create_namespace(), cleanup()

Available Applications:

HotelReservation: Microservice hotel booking
SocialNetwork: Social media app
AstronomyShop: OpenTelemetry e-commerce demo
FlightTicket, TrainTicket, TiDBCluster, Flower (federated learning)

Helm Integration (aiopslab/service/helm.py):

Static methods for chart install/uninstall
Automatic dependency resolution
Namespace creation and extra args support

Kubectl Wrapper (aiopslab/service/kubectl.py):

Abstracts K8s API calls
Pod management, namespace ops, config maps, logs
Container runtime detection (Docker/Containerd)

Fault Injection

Fault Injector Hierarchy:

FaultInjector (base)
  ├── SymptomFaultInjector (Chaos Mesh-based) ← Most common
  ├── ApplicationFaultInjector (App-level)
  ├── OSFaultInjector (OS-level)
  ├── HardwareFaultInjector
  ├── OperatorFaultInjector (K8s operator)
  ├── OTelFaultInjector (OpenTelemetry)
  ├── VirtualFaultInjector (Mock faults)
  └── NoOpFaultInjector (Testing)

SymptomFaultInjector (aiopslab/generators/fault/inject_symp.py):

Uses Chaos Mesh v2.6.2 for fault injection
Installed via Helm in chaos-mesh namespace
Supported fault types:
- inject_pod_failure(): Kill pods
- inject_network_delay(): Add latency
- inject_network_loss(): Drop packets
- Container kills, resource exhaustion, etc.
Pattern: inject_<fault_type>() and recover_<fault_type>()
Creates temporary YAML for each experiment

Workload Generation (aiopslab/generators/workload/wrk.py):

Wraps wrk2 load generator
Configurable: rate, connections, threads, duration, distribution (normal/exponential/uniform)
Supports Lua script payloads via ConfigMap
Launches as K8s Job for distributed load

Observability

Telemetry Collection (aiopslab/observer/observe.py):

Multi-threaded collection of traces, logs, metrics
Time-window based (configurable start/end)
Saves to: telemetry_data_YYYYMMDD_HHMMSS/

Metric Collection (aiopslab/observer/metric_api.py):

PrometheusAPI queries Prometheus
Collects: CPU, memory, network I/O per pod/container
Container limits, Istio metrics (latency, throughput)
Exports as CSV/DataFrame

Log Collection (aiopslab/observer/log_api.py):

LogAPI connects to Elasticsearch
Filters by namespace, pod, timestamp range
Exports as CSV

Trace Collection (aiopslab/observer/trace_api.py):

TraceAPI queries Jaeger for distributed traces
Service-level trace analysis

Evaluation

Evaluation Flow:

Agent submits solution via submit(answer)
Problem's eval(soln, trace, duration) is called
Task base class provides default metrics:
- Detection: Accuracy (Yes/No), TTD (Time to Detection)
- Localization: Accuracy (service name match), TTL (Time to Localization)
- Analysis: Root cause accuracy, TTA (Time to Analysis)
- Mitigation: Success rate, TTM (Time to Mitigation)
Custom metrics added via self.add_result(metric_name, value)
Results saved to data/results/ as JSON
Optional W&B logging (set USE_WANDB=true in .env)

LLM-as-Judge (optional):

Set qualitative_eval: true in config.yml
Evaluates reasoning quality using LLM
Prompts in aiopslab/orchestrator/evaluators/prompts.py

Code Patterns and Conventions

Async/Await Support

start_workload() can be sync or async
Orchestrator detects via inspect.iscoroutinefunction()
All agent get_action() methods are async

Template Method Pattern

Task base class defines contract, subclasses override:

get_task_description()
get_instructions()
perform_action()
eval()

Factory Pattern

ProblemRegistry creates problems on demand via lambdas
AgentRegistry instantiates agents by name
Enables parameterized problem variants

Context Manager Pattern

CriticalSection for thread-safe fault recovery
atexit handlers ensure cleanup on unexpected exit
Fault recovery registered during inject_fault()

Configuration Management

aiopslab/config.yml (copy from config.yml.example):

k8s_host: Control plane hostname (localhost/kind/)
k8s_user: Username on control plane
ssh_key_path: Path to SSH key
data_dir: Where telemetry/results are stored
qualitative_eval: Enable LLM-as-Judge evaluation
print_session: Print session trace after completion

Environment Variables (.env):

API keys: OPENAI_API_KEY, DEEPSEEK_API_KEY, DASHSCOPE_API_KEY, GROQ_API_KEY
W&B: USE_WANDB=true

Adding New Components

Adding a New Problem

Define Problem Class (in aiopslab/orchestrator/problems/<your_problem>/):

from aiopslab.orchestrator.tasks.localization import LocalizationTask
from aiopslab.service.apps.myapp import MyApp

class MyProblem(LocalizationTask):
    def __init__(self):
        self.app = MyApp()

    def start_workload(self):
        # Workload generation logic
        pass

    def inject_fault(self):
        # Fault injection logic
        pass

    def eval(self, soln, trace, duration):
        super().eval(soln, trace, duration)  # Default metrics
        # Add custom metrics
        self.add_result("custom_metric", value)
        return self.results

Register Problem (in aiopslab/orchestrator/problems/registry.py):

from aiopslab.orchestrator.problems.my_problem import MyProblem

class ProblemRegistry:
    def __init__(self):
        self.PROBLEM_REGISTRY = {
            # ... existing problems ...
            "my_problem-localization-1": MyProblem,
        }

Adding a New Application

Create Metadata JSON (aiopslab/service/metadata/myapp.json):

{
  "name": "MyApp",
  "description": "Description of the app",
  "namespace": "test-myapp",
  "Helm Config": {
    "release_name": "myapp-release",
    "chart_path": "path/to/helm/chart",
    "namespace": "test-myapp"
  }
}

Create Application Class (aiopslab/service/apps/myapp.py):

from aiopslab.service.apps.base import Application

class MyApp(Application):
    def __init__(self):
        super().__init__("path/to/metadata/myapp.json")

Adding a New Agent

Implement Agent (clients/myagent.py):

class MyAgent:
    def init_context(self, problem_desc: str, instructions: str, apis: dict):
        self.problem_desc = problem_desc
        self.instructions = instructions
        self.apis = apis

    async def get_action(self, observation: str) -> str:
        # Your agent logic
        return f"Action:\n```\n{api_call}\n```"

Register Agent (clients/registry.py):

from clients.myagent import MyAgent

class AgentRegistry:
    def __init__(self):
        self.AGENT_REGISTRY = {
            # ... existing agents ...
            "myagent": MyAgent,
        }

Important Implementation Notes

Shell Command Restrictions

No interactive commands: kubectl edit, docker logs -f
Use specific APIs instead: get_logs(), get_metrics(), get_traces()

Problem Execution Lifecycle

Initialization: Deploy app, create storage (OpenEBS for K8s)
Fault Injection: Deploy Chaos Mesh experiment, register recovery
Workload Start: Launch wrk2 job at specified rate
Agent Loop: Up to max_steps iterations of agent-environment interaction
Evaluation: Measure correctness and efficiency
Cleanup: Recover fault, delete namespace, remove storage

Fault Recovery

Always registered with atexit during inject_fault()
Thread-safe via CriticalSection context manager
Cleanup happens even on unexpected termination

Testing Strategy

Unit tests in tests/ organized by functionality
Parser tests: API extraction, argument parsing, context extraction
Registry tests: Action discovery via decorators
Shell tests: Command execution validation
Use mocks where possible to avoid K8s dependencies

Key Files Reference

cli.py: Interactive CLI for human agents
service.py: FastAPI service for remote execution
assessment.py: Batch evaluation script
aiopslab/orchestrator/orchestrator.py: Main orchestration engine
aiopslab/session.py: Session tracking and persistence
aiopslab/orchestrator/parser.py: Action parsing from agent responses
aiopslab/orchestrator/problems/registry.py: Problem definitions
clients/registry.py: Agent implementations
aiopslab/service/apps/: Application interfaces
aiopslab/generators/fault/: Fault injection implementations
aiopslab/generators/workload/wrk.py: Workload generation
aiopslab/observer/: Telemetry collection (logs, metrics, traces)

Common Workflows

Evaluating an Agent on a Problem

from aiopslab.orchestrator import Orchestrator
from clients.gpt import GPTAgent

# Create and register agent
agent = GPTAgent()
orch = Orchestrator()
orch.register_agent(agent, name="gpt-agent")

# Initialize and start problem
problem_desc, instructs, apis = orch.init_problem("pod_failure_hotel_res-detection-1")
agent.init_context(problem_desc, instructs, apis)
await orch.start_problem(max_steps=30)

# Results saved to data/results/<session_id>.json

Batch Evaluation

# Use assessment.py for batch evaluation
python assessment.py
# Configure problems and agents in the script

Debugging Agent Responses

Enable session printing: print_session: true in config.yml
Check parsed actions in session trace

Use ResponseParser directly for testing:

from aiopslab.orchestrator.parser import ResponseParser
parser = ResponseParser()
result = parser.parse(agent_response)

Azure Deployment with Terraform + Ansible

Deployment Modes

Mode	AIOpsLab Runs On	K8s Cluster	Use Case
Mode A	Controller VM (inside cluster)	Same machine	Production, full fault injection support
Mode B	Your laptop (remote kubectl)	Azure VMs	Development, debugging

Note: VirtualizationFaultInjector requires Docker on the machine running AIOpsLab. Use official Poetry installer, not apt install python3-poetry.

Tested on: WSL2 (Ubuntu 22.04) on Windows 11 with Azure VMs (Ubuntu 22.04 LTS, amd64). The deploy.py auto-install targets Linux/amd64; macOS and native Windows are not currently supported.

Quick Start (Mode B - Laptop)

# Single command: provisions VMs, runs Ansible, installs tools, configures AIOpsLab
python3 scripts/terraform/deploy.py --apply --resource-group <your-rg> --workers 2 --mode B

# After deploy completes, start AIOpsLab:
eval $(poetry env activate)
python3 cli.py

Quick Start (Mode A - Controller VM)

# Clone mode: git clones the repo on the controller
python3 scripts/terraform/deploy.py --apply --resource-group <your-rg> --workers 2 --mode A

# Dev mode: rsync local code to the controller
python3 scripts/terraform/deploy.py --apply --resource-group <your-rg> --workers 2 --mode A --dev

# After deploy, SSH to controller:
ssh -i ~/.ssh/id_rsa azureuser@<controller-ip>
cd ~/AIOpsLab && eval $(poetry env activate)
python3 cli.py

Other deploy.py commands

# Dry-run:
python3 scripts/terraform/deploy.py --plan --resource-group <your-rg> --workers 2

# Re-run setup without reprovisioning VMs (e.g., after code changes):
python3 scripts/terraform/deploy.py --setup-only --mode A --dev

# Restrict NSG access (SSH + K8s API) to a service tag or CIDR:
python3 scripts/terraform/deploy.py --apply --resource-group <your-rg> --allowed-ips CorpNetPublic

# Destroy infrastructure:
python3 scripts/terraform/deploy.py --destroy --resource-group <your-rg>

Key Files

File	Purpose
`scripts/terraform/deploy.py`	Single-command deployment (Terraform + Ansible + AIOpsLab setup)
`scripts/terraform/main.tf`	Azure VM provisioning (controller + workers)
`scripts/terraform/variables.tf`	Configurable parameters (VM size, count, etc.)
`scripts/terraform/generate_inventory.py`	Creates Ansible inventory from Terraform output
`scripts/ansible/setup_common.yml`	Installs Docker, K8s packages on all nodes
`scripts/ansible/remote_setup_controller_worker.yml`	Initializes K8s cluster, joins workers
`scripts/ansible/setup_aiopslab.yml`	Mode A: installs Python 3.11, Poetry, Helm, clones/rsyncs repo, runs poetry install
`scripts/ansible/templates/config.yml.j2`	Mode A: Jinja2 template for aiopslab/config.yml
`scripts/ansible/inventory.yml`	Generated inventory (don't edit manually)

Important Configuration

Ansible Inventory Variables (inventory.yml):

k8s_user: SSH username (e.g., azureuser)
user_home_base: Home directory base path
- /home for cloud VMs (Azure, AWS, GCP)
- /users for Emulab testbed
private_ip: Internal IP for K8s cluster communication
ansible_host: Public IP for SSH access

AIOpsLab config.yml (for Mode B):

k8s_host: <controller-public-ip>  # e.g., 20.150.145.167
k8s_user: azureuser
ssh_key_path: ~/.ssh/id_rsa

Common Issues and Fixes

Issue	Cause	Fix
`conntrack not found`	Missing package	Added to `setup_common.yml` prerequisites
Certificate error with kubectl	Cert missing public IP	Playbook adds `--apiserver-cert-extra-sans`
Kubeconfig uses private IP	Can't reach from laptop	Playbook auto-updates to public IP
Helm chart not found	Submodules not cloned	Run `git submodule update --init --recursive`
Submodule init fails in WSL	Worktree `.git` file has Windows paths	Run from Git Bash, not WSL
`poetry shell` not found	Removed in Poetry 2.0	Use `eval $(poetry env activate)` instead
Poetry "not supported" Python	System python too old	`poetry env use python3.11 && poetry install`
Path `/users/` not found	Wrong home base for cloud	Set `user_home_base: /home` in inventory

NSG (Network Security Group) Rules

The Terraform config creates NSG rules for:

SSH (22): Open to all (*) by default. Restrict via --allowed-ips flag or nsg_allowed_source variable.
K8s API (6443): Open to all (*) by default. Restrict via --allowed-ips flag or nsg_allowed_source variable.

To allow access from other IPs, modify main.tf or add rules via Azure CLI.

Destroying Infrastructure

cd scripts/terraform
terraform destroy -var="resource_group_name=<your-rg>"

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                     Your Laptop (WSL)                       │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│  │  Terraform  │→ │   Ansible   │→ │     AIOpsLab        │ │
│  │  (Azure VMs)│  │ (K8s setup) │  │ (kubectl + cli.py)  │ │
│  └─────────────┘  └─────────────┘  └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    Azure (VNet 10.0.0.0/16)                 │
│  ┌─────────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │   Controller    │  │  Worker 1   │  │  Worker N   │     │
│  │  (K8s control)  │  │             │  │             │     │
│  │  Public + Priv  │  │ Private IP  │  │ Private IP  │     │
│  └─────────────────┘  └─────────────┘  └─────────────┘     │
└─────────────────────────────────────────────────────────────┘

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Overview

Development Commands

Environment Setup

Running Agents Locally

Running AIOpsLab as a Service

Testing

Cluster Setup

Option 1: Local simulated cluster (kind)

Option 2: Azure VMs with Terraform + Ansible

Cluster Management

Core Architecture

Orchestrator Architecture

Problem Structure

Agent Integration

Action System

Service Layer

Fault Injection

Observability

Evaluation

Code Patterns and Conventions

Async/Await Support

Template Method Pattern

Factory Pattern

Context Manager Pattern

Configuration Management

Adding New Components

Adding a New Problem

Adding a New Application

Adding a New Agent

Important Implementation Notes

Shell Command Restrictions

Problem Execution Lifecycle

Fault Recovery

Testing Strategy

Key Files Reference

Common Workflows

Evaluating an Agent on a Problem

Batch Evaluation

Debugging Agent Responses

Azure Deployment with Terraform + Ansible

Deployment Modes

Quick Start (Mode B - Laptop)

Quick Start (Mode A - Controller VM)

Other deploy.py commands

Key Files

Important Configuration

Common Issues and Fixes

NSG (Network Security Group) Rules

Destroying Infrastructure

Architecture Diagram