Skip to content

noahspahn/AI-Incident-Commander

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Incident Commander

AI Incident Commander is a simple on‑call assistant that turns raw system alerts into a clear, human‑readable incident story. It opens incidents, summarizes what’s happening, suggests what to do next, and captures a post‑incident summary.

What it does (plain English)

  • Watches for problems (errors, latency spikes, crashes).
  • Creates a new incident automatically when something breaks.
  • Summarizes the symptoms and highlights likely causes.
  • Suggests next steps using runbooks.
  • Drafts an incident summary for postmortems.

Why it’s valuable

  • Saves time during incidents by reducing noise.
  • Gives non-experts a clear, readable incident overview.
  • Provides a repeatable demo of “AI + ops” that’s easy to show in a portfolio.

How it works (simple flow)

  1. A demo service fails (simulated fault).
  2. The backend records the incident and timeline events.
  3. An analysis pipeline creates a summary and stores it in S3.
  4. The React dashboard shows the incident list and timeline.

What you can see in the demo

  • A live incidents list with severity, service, and status.
  • A timeline of events and analysis updates.
  • A “Request Analysis” button that triggers the workflow.
  • Runbook recommendations filtered by service/tag.

Architecture (high level)

  • Frontend: React dashboard hosted on S3 + CloudFront.
  • Mobile: Flutter app for alerts and incident summaries.
  • Backend: Node.js Lambdas behind API Gateway for incidents/runbooks/comments.
  • AI/Analysis: Python Lambda (stub) invoked via Step Functions, stores summaries in S3.
  • Workflow: EventBridge for incident events, Step Functions for analysis pipeline.
  • Data: DynamoDB for incident state, S3 for logs/summaries, optional OpenSearch for search.
  • Infra: CDK stacks, EKS demo services with OpenTelemetry and chaos testing.

Repo layout

  • apps/web: React dashboard.
  • apps/mobile: Flutter alerts app.
  • services/api: Node.js API Lambdas.
  • services/analysis: Python analysis Lambdas.
  • infra: AWS CDK infrastructure.
  • docs: Architecture, data model, API, and demo flow.

Documentation

  • docs/ARCHITECTURE.md
  • docs/DATA_MODEL.md
  • docs/API.md
  • docs/DEMO.md

Deploy script

From the repo root:

./scripts/deploy.ps1

Optional:

./scripts/deploy.ps1 -ApiBaseUrl https://your-api-id.execute-api.region.amazonaws.com
./scripts/deploy.ps1 -SkipWebBuild
./scripts/deploy.ps1 -SkipWebDeploy

The script deploys CDK infra, builds the web app, and syncs apps/web/dist to the frontend S3 bucket (with a CloudFront invalidation).

Status

Core API + dashboard + analysis stub are working. Next steps: replace analysis stub with Bedrock, expand runbook management, and add real observability signal ingestion.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages