A production-inspired demo showcasing governed autonomy in system operations — where safe issues heal themselves, and risky actions require human approval.
This project demonstrates how machine learning + policy + human judgment can work together to build trustworthy self-healing systems.
flowchart TD
classDef safe fill:#e6fffa,stroke:#0f766e,stroke-width:1px;
classDef risky fill:#fff1f2,stroke:#be123c,stroke-width:1px;
classDef core fill:#eef2ff,stroke:#3730a3,stroke-width:1px;
A["System Metrics (Synthetic / Live)"] --> B["Feature Builder"]
B --> C["ML Severity Classifier"]
C --> D["Decision Engine"]
D -->|"AUTO_HEAL"| E["Auto Remediation"]
D -->|"HITL_REQUIRED"| F["Human Approval UI"]
E --> G["Action Taken"]
F --> H["Approve / Reject"]
G --> I["Audit Log (Governance)"]
H --> I
class E safe;
class F risky;
class D core;
This system continuously evaluates system health signals such as:
- Disk usage
- Memory usage
- CPU utilization
- DNS latency
- Firewall change requests
Using a trained ML model, it classifies the operational risk level and decides one of three outcomes:
| Decision | Meaning |
|---|---|
| ✅ NO_ACTION | System is healthy |
| ⚡ AUTO_HEAL | Safe remediation executed automatically |
| ✋ HITL_REQUIRED | Risky action requires human approval |
The result is a clear, explainable, and auditable remediation workflow.
Many “self-healing” systems fail because they:
- Automate everything (high risk)
- Or rely entirely on humans (slow, unscalable)
This project demonstrates a balanced approach:
Autonomy where safe.
Human control where risky.
This mirrors real-world operational decision-making in:
- Infrastructure platforms
- SRE / DevOps systems
- Security operations
- AI governance systems
- ML for risk classification, not blind automation
- Deterministic decisions (no randomness in outcomes)
- Human-in-the-Loop for high-blast-radius actions
- Clear separation of concerns
- Auditability by design
Classify operational severity, not predict complex time-series.
disk_usage (%)memory_usage (%)cpu_usage (%)dns_latency (ms)firewall_change_requested (0/1)
| Class | Meaning |
|---|---|
0 |
NO_ACTION |
1 |
AUTO_HEAL |
2 |
HITL_REQUIRED |
The model is trained locally using MLflow and exported as an approved artifact.
Actions that are:
- Reversible
- Low blast radius
- Operationally safe
Examples:
- Disk cleanup
- Memory cleanup
- Restarting a high-CPU service
Actions that are:
- Security-sensitive
- Network-impacting
- Potentially disruptive
Examples:
- DNS configuration changes
- Firewall port modifications
This distinction is intentional and realistic.
When HITL is required, the UI shows:
- Current system metrics
- Proposed actions
- Approval / rejection controls
- Notes for context
Every human decision is logged for auditability.
- 🔄 Refresh button to pull live metrics
- 📊 Live system diagnostics
- 🧭 Clear decision banner
- ⚡ Auto-healing status
- ✋ HITL approval workflow
- 🧾 Audit trail (JSON-based)
Designed to feel like a real operations console, not a toy demo.
python -m model.train_severity_model
streamlit run app.py