Welcome to the SRE Agent project! This open-source AI agent is here to assist your debugging, keep your systems healthy, and make your DevOps life a whole lot easier. Plug in your Kubernetes cluster, GitHub repo, and Slack, and let the agent do the heavy liftingβdiagnosing, reporting, and keeping your team in the loop.
SRE Agent is your AI-powered teammate for monitoring application and infrastructure logs, diagnosing issues, and reporting diagnostics after errors. It connects directly into your stack, so you can focus on building, not firefighting.
We wanted to learn best practices, costs, security, and performance tips for AI agents in production. Our journey is open-sourceβcheck out our Production Journey Page and Agent Architecture Page for the full story.
We've been writing blogs and sharing our learnings along the way. Check out our blog for insights and updates.
Contributions welcome! Join us and help shape the future of AI-powered SRE.
- π΅οΈββοΈ Root Cause Debugging β Finds the real reason behind app and system errors
- π Kubernetes Logs β Queries your cluster for logs and info
- π GitHub Search β Digs through your codebase for bugs
- π¬ Slack Integration β Notifies and updates your team
- π¦ Diagnose from Anywhere β Trigger diagnostics with a simple endpoint
Powered by the Model Context Protocol (MCP) for seamless LLM-to-tool connectivity.
- Docker
- A
.env
file in your project root (see below) - An app deployed on AWS EKS (Elastic Kubernetes Service)
Ready to see the agent in action? Let's get you set up.
Currently, we support EKS clusters.
-
Choose Option 2 and copy credentials into
~/.aws/credentials
:[default] aws_access_key_id=ABCDEFG12345 aws_secret_access_key=abcdefg123456789 aws_session_token=abcdefg123456789....=
You'll need some environment variables. Use our template .env.example
and the helper script:
python credential_setup.py
More details: credentials
Spin up all the services with Docker Compose:
docker compose up --build
Note: AWS credentials must be in your
~/.aws/credentials
file.
You'll see logs like this when everything's running:
orchestrator-1 | FastAPI Starting production server π
orchestrator-1 |
orchestrator-1 | Searching for package file structure from directories with
orchestrator-1 | __init__.py files
kubernetes-1 | β
Kubeconfig updated successfully.
kubernetes-1 | π Starting Node.js application...
orchestrator-1 | Importing from /
orchestrator-1 |
orchestrator-1 | module π app
orchestrator-1 | βββ π __init__.py
orchestrator-1 | βββ π client.py
orchestrator-1 |
orchestrator-1 | code Importing the FastAPI app object from the module with the following
orchestrator-1 | code:
orchestrator-1 |
orchestrator-1 | from app.client import app
orchestrator-1 |
orchestrator-1 | app Using import string: app.client:app
orchestrator-1 |
orchestrator-1 | server Server started at http://0.0.0.0:80
orchestrator-1 | server Documentation at http://0.0.0.0:80/docs
orchestrator-1 |
orchestrator-1 | Logs:
orchestrator-1 |
orchestrator-1 | INFO Started server process [1]
orchestrator-1 | INFO Waiting for application startup.
orchestrator-1 | INFO Application startup complete.
orchestrator-1 | INFO Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)
kubernetes-1 | 2025-04-24 12:53:00 [info]: Initialising Kubernetes manager {
kubernetes-1 | "service": "kubernetes-server"
kubernetes-1 | }
kubernetes-1 | 2025-04-24 12:53:00 [info]: Kubernetes manager initialised successfully {
kubernetes-1 | "service": "kubernetes-server"
kubernetes-1 | }
kubernetes-1 | 2025-04-24 12:53:00 [info]: Starting SSE server {
kubernetes-1 | "service": "kubernetes-server"
kubernetes-1 | }
kubernetes-1 | 2025-04-24 12:53:00 [info]: mcp-kubernetes-server is listening on port 3001
kubernetes-1 | Use the following url to connect to the server:
kubernetes-1 | http://localhost:3001/sse {
kubernetes-1 | "service": "kubernetes-server"
kubernetes-1 | }
This means all the services β Slack, GitHub, the orchestrator, the prompt and the MCP servers have started successfully and are ready to handle requests.
Trigger a diagnosis with a simple curl command:
curl -X POST http://localhost:8003/diagnose \
-H "accept: application/json" \
-H "Authorization: Bearer <token>" \
-d "text=<service>"
- Replace
<token>
with your dev bearer token (from.env
) - Replace
<service>
with the name of your target Kubernetes service
The agent will do its thing and report back in your configured Slack channel π
π©Ί Checking Service Health
A /health
endpoint is available on the orchestrator service:
curl -X GET http://localhost:8003/health
200 OK
= All systems go!503 Service Unavailable
= Something's up; check the response for details.
Want to run this in the cloud? Check out our deployment examples:
Find all the docs you need in the docs folder:
- Creating an IAM Role
- ECR Setup Steps
- Agent Architecture
- Production Journey
- Credentials
- Security Testing
Big thanks to:
- Suyog Sonwalkar for the Kubernetes MCP server
- Anthropic's Model Context Protocol team for the Slack and GitHub MCP servers
Check out our blog posts for insights and updates: