This repository demonstrates a modern Self-Healing Infrastructure workflow using the kagent AI agent. It showcases how an AI agent can bridge the gap between runtime failures in a Kubernetes cluster and the GitOps source of truth.
The demo environment simulates the daily operations of a Platform Engineer or SRE, emphasizing automated problem detection and remediation via Pull Requests.
The setup involves a complete local GitOps stack:
- Local Cluster: Managed by
k3d. - GitOps Engine:
ArgoCDmanaging the application lifecycle. - Application: A Guestbook App written in Rust.
- AI Agent:
kagentpowered by Google Gemini but you could change it to use an other provider.
- Docker
- k3d
- kubectl
The local environment is designed to be "One-Click" reproducible. Simply execute the bootstrap script to provision the entire lab:
./scripts/setup.shAfter running the setup script, you need to add an entry to your /etc/hosts file to resolve guestbook.local to your local machine. This is necessary because of the Traefik Ingress configuration used in this setup.
Add the following line to your /etc/hosts file:
127.0.0.1 guestbook.localThen you could browse the app here: http://guestbook.local:8080 and to browse ArgoCD UI: http://localhost:8080/
The previous steps established a traditional GitOps foundation. Now, we enter the AI-Ops territory by deploying kagent.
First, we need the kagent CLI to bootstrap the agent within our cluster. While Helm is an alternative, the CLI provides the most streamlined experience for this demonstration.
# Fetch and install the latest kagent binary
curl https://raw.githubusercontent.com/kagent-dev/kagent/refs/heads/main/scripts/get-kagent | bashWe will use the demo profile for a quick-start installation.
Note
The OpenAI Quirk: Currently, the kagent install command expects an OPENAI_API_KEY by default. Since our architecture explicitly uses Google Gemini, we bypass this requirement with a placeholder value to proceed with the bootstrap.
export OPENAI_API_KEY="dontcare" && kagent install --profile demoAccessing the kagent dashboard (UI)
kagent dashboard
kubectl port-forward -n kagent service/kagent-ui 8082:8080You could browse kagent UI: http://localhost:8082
To perform its duties, the agent needs two core capabilities: a reasoning engine (Gemini) and the ability to act on code (GitHub). Kubernetes capabilities are already setup by default. We store these credentials securely as Kubernetes Secrets in the kagent namespace.
kubectl create secret generic kagent-gemini \
-n kagent \
--from-literal GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
kubectl create secret generic github-pat \
-n kagent \
--from-literal GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_PATInstead of a generic assistant, we configure a Specialized SRE Agent. This is achieved by applying three specific layers of configuration:
- ModelConfig, see
kagent/kagent-gemini-config.yaml: Directs the agent to use Gemini as the primary inference engine. - MCP Server, see
kagent/kagent-github-mcp.yaml: Enables the Model Context Protocol (MCP) for GitHub, granting the agent the "tools" to read/write in the repository. - Agent Definition, see
kagent/kagent-devops-expert.yaml: The final "persona" that combines the model, the tools, and the SRE troubleshooting logic.
# Apply the Gemini configuration and GitHub connectivity (MCP)
kubectl -n kagent apply -f kagent/kagent-gemini-config.yaml
kubectl -n kagent apply -f kagent/kagent-github-mcp.yaml
# Deploy the final "DevOps Expert" agent definition
kubectl -n kagent apply -f kagent/kagent-sre-expert.yamlOur SRE agent is now ready and we will use it in the next part.
This scenario demonstrates a complete SRE recovery loop: from detecting a silent failure to automated remediation via a GitOps Pull Request.
- Action: Check the ArgoCD UI. Everything is Healthy and Synced.
- Context: The Rust Guestbook application is running. Users can access it, and Kubernetes reports no issues.
- Action: Manually edit the
gitops/guestbook/guestbook-rust.yamlto update the Liveness Probe on the wrong port (9999 instead of 3000). Commit and Push this change. - Observation: ArgoCD detects the change and syncs. The Pods start failing their health checks and enter a
CrashLoopBackOff. The ArgoCD status turns Red (Degraded).
- Action: Browse to http://localhost:8082/agents/kagent/sre-expert-agent/chat and enter the following prompt:
have a look in the guestbook-rust pod and related event, it seems there is an issue. Can you identify the root cause and propose a fix plan ?- Narration: "Instead of a manual fix, we use kagent. Powered by Gemini, it correlates live cluster events with our GitHub source code."
- Expected AI Reasoning/Steps:
- Identifies the
LivenessProbeerror in the cluster events. - Searches the Git repository for the corresponding Deployment file.
- Detects the mismatch between
containerPort: 3000andport: 9999. - Proposes a remediation action plan.
- If the remediation plan fits well, you can accept it by prompting.
- Creates the PR with the fix.
- Identifies the
Capture of remediation plan created by kagent.
Capture of kagent actions done when the remediation plan was accepted.
- Action: Navigate to the "Pull Requests" tab of your GitHub repository.
- Narration: "kagent follows strict GitOps principles. It does not patch the cluster directly; instead, it proposes a fix in the Git repository via a PR."
Capture of Pull Request created by kagent.
- Action: Merge the Pull Request on GitHub.
- Observation: ArgoCD automatically detects the merge, syncs the corrected manifest, and the Pods return to a Healthy state.
- Conclusion: "The incident is resolved with a perfect audit trail in Git, without the SRE ever having to manually edit a YAML file."
Feel free to enjoy and play with this lab/demo. Try differents scenarios by breaking others resources or test others agents (some preconfigured agents in kagent).
To clean the environment
./scripts/cleanup.sh