Skip to content

sarkarbikram90/symmetrical-engine

Repository files navigation

🛠️ Autonomous System Diagnostics with Auto-Healing & Human-in-the-Loop (HITL)

A production-inspired demo showcasing governed autonomy in system operations — where safe issues heal themselves, and risky actions require human approval.

This project demonstrates how machine learning + policy + human judgment can work together to build trustworthy self-healing systems.


🧩 High-Level Workflow

flowchart TD
    classDef safe fill:#e6fffa,stroke:#0f766e,stroke-width:1px;
    classDef risky fill:#fff1f2,stroke:#be123c,stroke-width:1px;
    classDef core fill:#eef2ff,stroke:#3730a3,stroke-width:1px;

    A["System Metrics (Synthetic / Live)"] --> B["Feature Builder"]
    B --> C["ML Severity Classifier"]
    C --> D["Decision Engine"]

    D -->|"AUTO_HEAL"| E["Auto Remediation"]
    D -->|"HITL_REQUIRED"| F["Human Approval UI"]

    E --> G["Action Taken"]
    F --> H["Approve / Reject"]

    G --> I["Audit Log (Governance)"]
    H --> I

    class E safe;
    class F risky;
    class D core;

Loading

🚀 What This Project Does

This system continuously evaluates system health signals such as:

  • Disk usage
  • Memory usage
  • CPU utilization
  • DNS latency
  • Firewall change requests

Using a trained ML model, it classifies the operational risk level and decides one of three outcomes:

Decision Meaning
NO_ACTION System is healthy
AUTO_HEAL Safe remediation executed automatically
HITL_REQUIRED Risky action requires human approval

The result is a clear, explainable, and auditable remediation workflow.


🎯 Why This Problem?

Many “self-healing” systems fail because they:

  • Automate everything (high risk)
  • Or rely entirely on humans (slow, unscalable)

This project demonstrates a balanced approach:

Autonomy where safe.
Human control where risky.

This mirrors real-world operational decision-making in:

  • Infrastructure platforms
  • SRE / DevOps systems
  • Security operations
  • AI governance systems

🧠 Core Design Principles

  • ML for risk classification, not blind automation
  • Deterministic decisions (no randomness in outcomes)
  • Human-in-the-Loop for high-blast-radius actions
  • Clear separation of concerns
  • Auditability by design

🤖 The ML Model (Simple, Explainable, Purposeful)

Model Objective

Classify operational severity, not predict complex time-series.

Input Features

  • disk_usage (%)
  • memory_usage (%)
  • cpu_usage (%)
  • dns_latency (ms)
  • firewall_change_requested (0/1)

Output Classes

Class Meaning
0 NO_ACTION
1 AUTO_HEAL
2 HITL_REQUIRED

The model is trained locally using MLflow and exported as an approved artifact.


🔁 Auto-Healing vs HITL (By Design)

✅ Auto-Healing (Safe)

Actions that are:

  • Reversible
  • Low blast radius
  • Operationally safe

Examples:

  • Disk cleanup
  • Memory cleanup
  • Restarting a high-CPU service

✋ Human-in-the-Loop (Risky)

Actions that are:

  • Security-sensitive
  • Network-impacting
  • Potentially disruptive

Examples:

  • DNS configuration changes
  • Firewall port modifications

This distinction is intentional and realistic.


🧑‍⚖️ Human-in-the-Loop Experience

When HITL is required, the UI shows:

  • Current system metrics
  • Proposed actions
  • Approval / rejection controls
  • Notes for context

Every human decision is logged for auditability.


📊 Streamlit UI Features

  • 🔄 Refresh button to pull live metrics
  • 📊 Live system diagnostics
  • 🧭 Clear decision banner
  • ⚡ Auto-healing status
  • ✋ HITL approval workflow
  • 🧾 Audit trail (JSON-based)

Designed to feel like a real operations console, not a toy demo.

🧪 How to Run Locally

Train the model

python -m model.train_severity_model

Run the UI

streamlit run app.py

About

Autonomous System Diagnostics with Auto-Healing + Human-in-the-loop

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages