Better Prometheus alerts for Kubernetes - smart grouping, AI enrichment, and automatic remediation
-
Updated
Apr 6, 2026 - Python
Better Prometheus alerts for Kubernetes - smart grouping, AI enrichment, and automatic remediation
Runbook automation platform with deep observability integrations for SRE & On-Call Teams
🚀 SRE incident response playbooks for AWS & Kubernetes. Step-by-step troubleshooting guides to help on-call engineers resolve infrastructure issues faster.
A production-style SRE learning project demonstrating Kubernetes reliability patterns, failure handling, and observability using FastAPI, PostgreSQL, Prometheus, and Grafana. Built to understand monitoring, alerting, and recovery in cloud-native systems through intentional chaos experiments.
Portfolio-ready cybersecurity program starter kit with incident response runbooks, governance templates, and a lightweight risk register.
Bot Telegram para gestão de incidentes — escalonamento automático, runbooks, MTTR — reduz tempo de resposta a incidentes críticos em até 60%
Runbook utilities for Azure automation tasks
Production Engineering incident-response lab: SLOs, burn-rate alerts, runbooks, capacity planning, postmortems, change safety
Operational decision engine with risk scoring, policy management, and runbook automation
Sanitized support-operations portfolio showing triage, escalation design, QA coverage, and a lightweight AI triage evaluation workflow
An exchange-grade wallet monitoring tool for BTC and EVM chains. Flags large outflows and interactions with sanctioned/mixer contracts via structured JSON alerts. Includes professional multi-sig and incident response runbooks.
A platform for running runbooks or sops in a kubernetes cluster written in python and as a kubernetes operator (kopf).
Automated system and database troubleshooting with AI runbooks, Selenium agent, React dashboard, and Flask log analyzer
Production SRE runbooks, SLO calculators, and incident automation — Stripe/Coinbase/Zoom ready
Meta-style Release-to-Production validation CLI with fleet aggregation, enforce-mode gating exit codes, and explainable checks.
Python runbooks for Azure automation Account
Self-improving starter skill and operator toolkit for running Codex or Claude Code as the orchestrator over Symphony workers with Linear-managed execution.
A sample function to run behind OCI API Gateway to run with OCI Generative AI Chat.
A durable execution layer for AI-assisted operational runbooks with approvals, retries, and audit history.
Enrich Alertmanager webhooks by correlating them with OpenSearch logs and delivering root cause analysis through Telegram channels.
Add a description, image, and links to the runbooks topic page so that developers can more easily learn about it.
To associate your repository with the runbooks topic, visit your repo's landing page and select "manage topics."