TARSy (Thoughtful Alert Response System) is an intelligent SRE system that automatically processes alerts through parallel agent chains, using MCP (Model Context Protocol) servers and optional runbooks for comprehensive multi-stage incident analysis.
This is the Go-based hybrid rewrite of TARSy, replacing the original Python implementation (now deprecated). The new architecture splits responsibilities between a Go orchestrator and a stateless Python LLM service for better performance, type safety, and scalability.
tarsy-gh-demo.webm
- README.md -- This file: project overview and quick start
- docs/architecture-overview.md -- High-level architecture, components, and processing flow
- docs/functional-areas-design.md -- Detailed design of each functional area with file paths and interfaces
- docs/slack-integration.md -- Slack notification setup, configuration, and threading
- deploy/README.md -- Deployment and configuration guide
- deploy/config/README.md -- Configuration reference
- Go 1.25+ -- Backend orchestrator
- Python 3.13+ -- LLM service runtime
- Node.js 24+ -- Dashboard development and build tools
- uv -- Modern Python package manager
- Install:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Install:
- PostgreSQL 17+ -- Or Podman/Docker for local development
- protoc -- Protocol Buffers compiler (for gRPC code generation)
- Podman (or Docker) -- Container runtime
- podman-compose -- Multi-container application management
- Install:
pip install podman-compose
- Install:
Quick Check: Run
make checkto verify your development environment.
# 1. Install all dependencies (Go + Python + Dashboard)
make setup
# 2. Configure environment (REQUIRED)
cp deploy/config/.env.example deploy/config/.env
# Edit deploy/config/.env and set:
# - At least one LLM API key (e.g. GOOGLE_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY)
# 3. Start everything (database, backend, LLM service, dashboard)
make devServices will be available at:
- TARSy Dashboard: http://localhost:5173
- Backend API: http://localhost:8080
- LLM Service: gRPC on port 50051
Stop all services: make dev-stop
For containerized and OpenShift deployment with OAuth authentication, see deploy/README.md.
- Configuration-Based Agents: Deploy new agents and chain definitions via YAML without code changes
- Parallel Agent Execution: Run multiple agents concurrently with automatic synthesis. Supports multi-agent, replica, and comparison parallelism for A/B testing providers or strategies
- Dynamic Orchestration with Sub-Agents: Orchestrator agents use LLM reasoning to dispatch specialized sub-agents at runtime, react to partial results, and synthesize findings -- replacing static parallel chains with adaptive, multi-phase investigation flows
- MCP Server Integration: Agents dynamically connect to MCP servers for domain-specific tools (kubectl, database clients, monitoring APIs)
- Multi-LLM Provider Support: OpenAI, Google Gemini, Anthropic, xAI, Vertex AI -- configure and switch via YAML with native thinking mode
- Force Conclusion: Automatic conclusion at iteration limits with hierarchical configuration (system, chain, stage, or agent level)
- Flexible Alert Processing: Accept arbitrary text payloads from any monitoring system
- Optional Runbook Integration: Fetch supplemental guidance from GitHub repositories to steer agent behavior
- Data Masking: Hybrid masking combining structural analysis (Kubernetes Secrets) with regex patterns to protect sensitive data
- Tool Result Summarization: Enabled by default — LLM-powered summarization of verbose MCP outputs (>5K tokens) to reduce token usage and improve reasoning
- SRE Dashboard: Real-time monitoring with live LLM streaming and interactive chain timeline visualization
- Follow-up Chat: Continue investigating after sessions complete with full context and tool access
- Slack Notifications: Automatic notifications with thread-based message grouping via fingerprint matching
- Comprehensive Audit Trail: Full visibility into chain processing with stage-level timeline and trace views
TARSy uses a hybrid Go + Python architecture where the Go orchestrator handles all business logic, session management, and real-time streaming, while a stateless Python service manages LLM interactions over gRPC.
┌───────────────┐
│ MCP Servers │
│ (kubectl, │
│ monitoring) │
└───────┬───────┘
│
┌──────────┐ WebSocket ┌─────────┴──────────┐ gRPC ┌──────────────┐
│ Browser │◄───────────►│ Go Orchestrator │◄───────►│ Python LLM │
│ (React) │ HTTP │ (Echo + Ent) │ Stream │ Service │
└──────────┘ └─────────┬──────────┘ └──────┬───────┘
│ │
PostgreSQL Gemini / OpenAI /
(Ent ORM) Anthropic / xAI /
Vertex AI
- Alert arrives from monitoring systems with flexible text payload
- Chain selected based on alert type -- static parallel chains or dynamic orchestrator
- Runbook injected (optional) -- if configured, fetches supplemental guidance from GitHub to steer agent behavior
- Agents investigate -- static chains launch parallel agents per stage; orchestrator agents dynamically dispatch sub-agents based on LLM reasoning, react to partial results, and dispatch follow-ups
- Results synthesized -- static chains use a dedicated SynthesisAgent; orchestrators synthesize within the same execution as results arrive
- Forced conclusion at iteration limits -- one final LLM call produces the best analysis with available data (no pause/resume)
- Comprehensive analysis provided to engineers with actionable recommendations
- Follow-up chat available after investigation completes
- Full audit trail captured with stage-level detail and sub-agent trace trees
| Component | Location | Tech |
|---|---|---|
| Go Orchestrator | cmd/tarsy/, pkg/ |
Go 1.25, Echo v5, Ent ORM, gRPC |
| Python LLM Service | llm-service/ |
Python 3.13, gRPC, Gemini, LangChain |
| Dashboard | web/dashboard/ |
React 19, TypeScript, Vite 7, MUI 7 |
| Database | ent/ |
PostgreSQL 17, Ent ORM with migrations |
| Proto Definitions | proto/ |
Protocol Buffers (gRPC service contracts) |
| Deployment | deploy/ |
Podman Compose, OAuth2-proxy, Nginx |
| E2E Tests | test/e2e/ |
Testcontainers, real PostgreSQL, WebSocket |
POST /api/v1/alerts-- Submit an alert for processing (queue-based, returnssession_id)GET /api/v1/alert-types-- Supported alert typesGET /api/v1/ws-- WebSocket for real-time progress updates with channel subscriptionsGET /health-- Health check with service status and queue metrics
GET /api/v1/sessions-- List sessions with filtering and paginationGET /api/v1/sessions/active-- Currently active sessionsGET /api/v1/sessions/filter-options-- Available filter valuesGET /api/v1/sessions/:id-- Session detail with chronological timelineGET /api/v1/sessions/:id/summary-- Final analysis and executive summaryGET /api/v1/sessions/:id/status-- Lightweight polling status (id, status, final_analysis, executive_summary, error_message)POST /api/v1/sessions/:id/cancel-- Cancel an active or paused session
POST /api/v1/sessions/:id/chat/messages-- Send message (AI response streams via WebSocket)
GET /api/v1/sessions/:id/timeline-- Session timeline eventsGET /api/v1/sessions/:id/trace-- List LLM and MCP interactionsGET /api/v1/sessions/:id/trace/llm/:interaction_id-- LLM interaction detail with conversation reconstructionGET /api/v1/sessions/:id/trace/mcp/:interaction_id-- MCP interaction detail
GET /api/v1/runbooks-- List available runbooks from configured GitHub repoGET /api/v1/system/warnings-- Active system warningsGET /api/v1/system/mcp-servers-- Available MCP servers and toolsGET /api/v1/system/default-tools-- Default tool configuration
The containerized deployment provides a production-like environment:
Browser → OAuth2-Proxy (8080) → Go Backend (8080) → LLM Service (gRPC)
↓
PostgreSQL
- OAuth2 Authentication: GitHub OAuth integration via oauth2-proxy
- PostgreSQL Database: Persistent storage with auto-migration
- Production Builds: Optimized multi-stage container images
- Security: All API endpoints protected behind authentication
- Alert Types: Define in
deploy/config/tarsy.yaml-- no code changes required - MCP Servers: Add to
tarsy.yamlwith stdio, HTTP, or SSE transport - Agents: Create Go agent classes extending BaseAgent, or define configuration-based agents in YAML
- Chains: Define multi-stage workflows in YAML with parallel execution support
- LLM Providers: Built-in providers work out-of-the-box. Add custom providers via
deploy/config/llm-providers.yaml
make test # Run all tests (Go + Python + Dashboard)
make test-go # Go tests only
make test-unit # Go unit tests
make test-e2e # Go end-to-end tests (requires Docker/Podman)
make test-llm # Python LLM service tests
make test-dashboard # Dashboard testsmake help # Show all available commands
make fmt # Format code (Go + Python)
make lint # Run linters (Go)
make ent-generate # Regenerate Ent ORM code
make proto-generate # Regenerate protobuf/gRPC code
make db-psql # Connect to PostgreSQL shell
make db-reset # Reset database- Verify PostgreSQL is running:
make db-status - Check PostgreSQL logs:
make db-logs - Connect manually:
make db-psql - Reset if corrupted:
make db-reset
