Predictive condition monitoring for Kubernetes-based pump stations on ABB Edgenius
Detect. Correlate. Explain. Forecast — before the failure reaches the field.
EdgeMind watches an entire industrial pipeline from the outside — using only standard infrastructure metrics (CPU, memory, network, disk, PVC) scraped from Prometheus. Five specialized AI agents detect anomalies, a deterministic correlation engine bundles related signals, and an LLM orchestrator traces them back to a single root cause with a plain-English explanation a field engineer can act on. A Dynamic Mode Decomposition (DMD) agent forecasts resource failures before they happen.
Zero workload modification. Zero custom instrumentation. Zero sidecars.
- The Problem
- The Solution
- Key Features
- System Architecture
- Layer 0 — The Pump-Station Pipeline
- Layer 1 — Detection Agents
- Layer 2 — AI Correlation & Orchestration
- DMD Early Warning
- The Dashboard
- End-to-End Workflow
- Finding Schema
- Tech Stack
- Repository Structure
- Getting Started
- Fault Injection & Demo Scenarios
- API Reference
- Testing
- Roadmap
In an industrial edge deployment, a single pump-station failure rarely stays contained. A bearing fault on one sensor cascades: the data collector spikes, the historian gets overloaded, the feature pipeline starves for CPU, and downstream export services contend for the same disk. By the time an operator sees a dozen red alerts, the real question — "which one started it?" — is buried under noise.
Traditional monitoring tells you what is on fire. It does not tell you why, in what order, or what's about to break next.
EdgeMind adds three layers of intelligence on top of a running Kubernetes workload:
| Layer | What it does | |
|---|---|---|
| 🟦 | Detection | 5 domain agents read Prometheus metrics and emit structured findings with evidence, baselines, and ETAs |
| 🟨 | Correlation | A deterministic 45-second window engine bundles related findings across agents — no LLM, fully reproducible |
| 🟪 | Reasoning | An OpenAI orchestrator uses live tools (PromQL, pod logs, K8s events) to trace the causal chain to a single root cause |
The result is delivered to a real-time React command center: one incident, one root cause, one recommendation — plus a forecast of what's coming.
- Multi-agent anomaly detection — CPU, Memory, Storage, Network/Log, and DMD agents, each an independent specialist
- Deterministic correlation — reproducible 45s windowing groups multi-agent signals before any AI is involved
- LLM root-cause orchestration — tool-calling agent that queries Prometheus, reads pod logs, and inspects K8s events to reason from evidence (not assumptions)
- DMD predictive early warning — Dynamic Mode Decomposition forecasts resource breaches minutes before they occur
- 6-page operations dashboard — command center, resource radar, dependency graph, anomaly timeline, AI investigation, and a live demo lab
- Industrial Copilot — chat with live telemetry and incident context
- Built-in fault injection — 7 physical fault modes + 4 scripted cascade scenarios for repeatable demos
- Infra-only & non-invasive — no SDK changes, no sidecars, watches any workload from the outside
- Fully tested — agents and pipeline services run their unit suites without a cluster
EdgeMind is a three-layer system running across two Kubernetes namespaces.
flowchart TB
subgraph PS[" pump-station namespace — the monitored workload (Layer 0)"]
direction LR
S1["sensor-sim-1<br/>(OPC-UA pump)"]
S2["sensor-sim-2<br/>(OPC-UA pump)"]
S3["sensor-sim-3<br/>(OPC-UA pump)"]
COL["opc-ua-collector"]
HIST["data-historian<br/>(InfluxDB)"]
FE["feature-extractor"]
HS["health-scorer"]
AM["alert-manager"]
BS["batch-sync"]
MU["mock-upload"]
S1 & S2 & S3 --> COL --> HIST --> FE --> HS --> AM --> BS --> MU
end
subgraph MON[" monitoring namespace — EdgeMind (Layers 1 & 2)"]
direction TB
PROM["Prometheus<br/>(node-exporter + kube-state-metrics)"]
subgraph AG["edgemind-agents (Layer 1)"]
CPU["CPU agent"]
MEM["Memory agent"]
STO["Storage agent"]
NET["Network + Log agent"]
DMD["DMD agent "]
end
REDIS[("Redis<br/>edgemind:findings")]
subgraph SRV["edgemind-server (Layer 2)"]
CF["Correlation Filter<br/>(45s window, deterministic)"]
ORCH["AI Orchestrator<br/>(OpenAI + tools)"]
end
DASH["React Dashboard<br/>(WebSocket)"]
end
PS -. scraped metrics .-> PROM
PROM --> AG
AG --> REDIS
REDIS --> CF --> ORCH
ORCH -- tool calls --> PROM
ORCH --> DASH
The golden rule: EdgeMind agents talk to Prometheus only. They never connect to InfluxDB, the OPC-UA bus, or any workload API. This makes EdgeMind portable to any Kubernetes workload — the pump station is just the demonstration target.
A realistic 10-service industrial data pipeline: three OPC-UA pump simulators feed a collect → store → analyze → score → alert → export chain. Every service exposes a fault-injection or behavior API so cascades can be triggered on demand.
| Component | Role | Tech |
|---|---|---|
| sensor-sim-1/2/3 | OPC-UA pump simulators with physical fault-injection API | asyncua, FastAPI |
| opc-ua-collector | Subscribes to OPC-UA tags, writes to the historian | asyncua, influxdb-client |
| data-historian | Time-series store | InfluxDB 2.x |
| feature-extractor | Computes bearing health, vibration & temperature trends | scipy, numpy |
| health-scorer | Classifies pump state → HEALTHY / WARNING / CRITICAL | pure Python |
| alert-manager | Enriches, deduplicates, writes alert JSONL to shared PVC | FastAPI |
| batch-sync | Bulk Parquet export to PVC, simulates cloud upload | pandas, pyarrow |
| mock-upload | Simulated cloud upload endpoint | FastAPI |
Five agents poll Prometheus every 15 seconds, maintain rolling statistical baselines, and publish structured findings to Redis. Each is a focused specialist.
cpu_spike · cpu_throttle · cpu_contention
Absolute + relative gating over a rolling window (z-score used only for severity grading, not the trigger — idle pods produce huge z-scores on trivial bumps). Requires sustained breach across multiple cycles to suppress single-scrape noise. Flags node-level contention when multiple pods spike together.
memory_leak · pre_oom · oomkill_detected · node_memory_pressure · memory_step_change
Linear regression over the RSS window computes a leak slope in MB/min. pre_oom projects an ETA to OOM at the current growth rate. OOMKills are confirmed against pre-restart working-set/limit ratios.
io_saturation · write_burst · pvc_fill · pvc_contention · restart_linked_to_io
Z-score on write-rate for burst detection; linear regression on PVC used_bytes for time-to-full forecasting. PVC→pod mapping is built live from the Kubernetes API. Contention fires when ≥2 high-I/O pods share one PVC.
network_flood · packet_drop · dependency_confirmed · log_error_surge · timeout_pattern · pump_health_critical · crash_loop · k8s_oomkill
P75 baseline flood detection; cross-pod dependency confirmation (TX spike on A correlated with RX spike on B within a lag window). Parses health-scorer structured logs for pump-level fault signatures and watches Kubernetes events for CrashLoopBackOff / OOMKilling / BackOff.
dmd_cpu_forecast · dmd_mem_forecast · dmd_io_forecast · dmd_net_forecast · dmd_instability
Forecasts resource breaches before they occur using Dynamic Mode Decomposition. See below ↓
Before any AI runs, a pure-Python engine (correlation_filter.py) reads findings from Redis and groups them into 45-second windows, deduplicating by (anomaly_type, pod). It triggers the orchestrator on one of three conditions:
| Trigger | Condition |
|---|---|
multi_agent |
2+ findings from different agents in the window |
single_critical |
one CRITICAL finding of a high-confidence type (OOMKill, memory leak, PVC fill, I/O saturation) |
single_agent_significant |
a self-sufficient signal (e.g. pump_health_critical) — deferred until window close so all evidence accumulates |
A 10-minute cooldown prevents alert storms. Relative-threshold detectors (cpu_spike, network_flood) deliberately require corroboration to fire — eliminating the near-idle false-positive class.
When triggered, the orchestrator (orchestrator.py) runs an agentic tool-calling loop against an OpenAI-compatible model. It is given the pipeline topology and the correlated bundle, then investigates with live tools:
sequenceDiagram
participant CF as Correlation Filter
participant O as AI Orchestrator
participant P as Prometheus
participant K as Kubernetes
participant D as Dashboard
CF->>O: CorrelatedSignalBundle (findings + topology)
loop Agentic investigation (≤60s)
O->>P: query_prometheus(promql)
O->>K: get_pod_logs(pod) / get_kubernetes_events()
Note over O: Trace causal chain upstream<br/>to the earliest abnormal pod
end
O->>D: Root cause + causal chain + confidence + insight + recommendation
The orchestrator is constrained to reason from observed evidence only — it must cite real metrics/log lines or explicitly state "no errors found," and lower its confidence when evidence is inconclusive. Output is a structured verdict:
{
"root_cause_pod": "sensor-sim-2",
"causal_chain": ["sensor-sim-2", "opc-ua-collector", "data-historian"],
"alert_type": "cascade",
"confidence": 0.92,
"insight": "The Pump 2 sensor switched to flood mode, overwhelming the data collection service...",
"recommendation": "Throttle sensor-sim-2 emission rate or scale the collector."
}Confidence scoring — ≥0.9 multi-agent agreement + topological ordering · 0.7–0.9 two agents agree · 0.5–0.7 single/ambiguous · <0.5 flagged for manual investigation.
Most monitoring is reactive: it fires when a threshold is already crossed. EdgeMind's DMD agent is predictive.
Dynamic Mode Decomposition (Schmid 2010) is a technique from fluid dynamics that decomposes a multivariate time series into spatial-temporal modes, each with a growth/decay rate. EdgeMind applies Exact DMD (dmd_core.py) to live resource metrics:
- Build a snapshot matrix of recent metric history
- Thin/truncated SVD → reduced linear operator
à - Eigendecomposition → eigenvalues
λand DMD modesΦ - Convert each eigenvalue to a continuous growth rate
σ = log|λ| / dt - A positive dominant growth rate means a metric is on an exponential climb toward its limit → emit a forecast finding before the breach
This means EdgeMind can warn "feature-extractor memory is trending to OOM in ~6 minutes" while the value is still nominal — turning a 2am page into a planned intervention.
A real-time React 18 single-page app, fed by a FastAPI WebSocket. Six purpose-built views:
| Route | Page | Purpose |
|---|---|---|
/ |
Command Center | Live ops overview — cluster vitals, active incidents, pump digital twins, top risky pods, forecast strip |
/radar |
Resource Radar | Per-pod deep dive — CPU/mem/IO charts, app-specific panels, a 3D pump scene, fault injection controls |
/graph |
Correlation Map | Interactive dependency graph with live node health and animated causal-path overlay |
/timeline |
Anomaly Timeline | Scrollable event timeline with correlation brackets linking findings to incidents |
/investigate |
AI Investigation | Full orchestrator output — causal chain steps, evidence matrix, "why-not" reasoning, confidence tiers |
/demo |
Demo Lab | One-click scenario launcher + manual fault controls + live cascade monitor |
Plus a floating EdgeMind Copilot chat that answers questions against live telemetry, findings, and alerts.
![]() Correlation Map — live dependency graph |
![]() AI Investigation — root cause & evidence |
![]() Anomaly Timeline — correlated events |
![]() Demo Lab — scenario launcher |
flowchart LR
A["Fault injected<br/>(e.g. flood on Pump 2)"] --> B["Metrics shift<br/>in Prometheus"]
B --> C["Agents detect<br/>findings → Redis"]
C --> D["Correlation filter<br/>bundles 45s window"]
D --> E["Orchestrator<br/>investigates with tools"]
E --> F["Root cause +<br/>recommendation"]
F --> G["Dashboard<br/>live update via WS"]
C -.-> H["DMD forecasts<br/>breach early"]
H -.-> G
A typical cascade, start to finish:
- Operator injects
floodon sensor-sim-2 (10× emission rate) - Network/Log agent detects
network_flood; CPU agent detectscpu_spikeon the collector - Two different agents in one 45s window →
multi_agenttrigger - Orchestrator queries Prometheus, reads collector logs, traces upstream to the sensor
- Verdict: root cause = sensor-sim-2, cascade, 92% confidence — pushed to every dashboard in real time
Every finding published to Redis is a self-describing evidence package:
{
"finding_id": "uuid",
"timestamp": "2026-06-13T15:54:44+00:00",
"agent": "cpu | memory | storage | network_log | dmd",
"anomaly_type": "cpu_spike",
"severity": "warning | critical | info",
"pod": "opc-ua-collector",
"namespace": "pump-station",
"current_value": 0.82,
"baseline_value": 0.05,
"deviation": "Z-score 4.2σ above 75-cycle baseline",
"evidence": ["cpu rose 0.011 → 0.0225 cores over 3 cycles", "..."],
"affected_pods": [],
"pvc_name": null,
"eta_minutes": null
}| Domain | Technologies |
|---|---|
| Orchestration | Kubernetes (k3s / k3d), Helm |
| Backend | Python, FastAPI, Uvicorn, asyncio |
| AI / ML | OpenAI (tool-calling orchestrator), NumPy + SciPy (DMD, regression, z-score) |
| Messaging / State | Redis (async), WebSockets |
| Observability | Prometheus, node-exporter, kube-state-metrics |
| Data pipeline | OPC-UA (asyncua), InfluxDB 2.x, pandas, pyarrow |
| Frontend | React 18, Vite, React Router, Recharts, D3, Three.js / React-Three-Fiber |
| Testing | pytest |
k8s-Pod-Resource-AI-Driven-Correlation/
├── k8s/ ← Kubernetes manifests
│ ├── pump-station/ ← 10 service manifests + PVC / secrets / quota
│ └── monitoring/ ← Prometheus, Redis, edgemind-agents, server, RBAC
├── edgemind_agents/ ← Layer 1: detection agents
│ ├── agents/ ← cpu, memory, storage, network_log, dmd, base
│ ├── dmd_core.py ← Exact DMD math engine (SVD + eig)
│ ├── anomaly_types.py ← all anomaly types, thresholds, config
│ └── collector.py ← Prometheus scrape loop
├── edgemind_server/ ← Layer 2: correlation + AI + API
│ ├── correlation_filter.py ← deterministic 45s windowing
│ ├── orchestrator.py ← OpenAI tool-calling root-cause engine
│ ├── dependency_graph.py ← live topology for prompts & UI
│ ├── tools.py ← query_prometheus / get_pod_logs / get_kubernetes_events
│ └── main.py ← FastAPI app + WebSocket
├── edgemind-dashboard/ ← React 18 + Vite frontend (6 pages)
├── sensor_sim/ ← OPC-UA pump simulators (Layer 0)
├── opc_ua_collector/ ← OPC-UA → InfluxDB
├── feature_extractor/ ← InfluxDB → pump features
├── health_scorer/ ← features → pump health state
├── alert_manager/ ← health state → enriched alerts
├── batch_sync/ · mock_upload/ ← export & simulated cloud upload
├── common/ ← shared contract (schema, constants)
├── deploy.sh ← one-command cluster deploy
├── docker-compose.yml ← local (non-k8s) bring-up
└── SETUP.md ← deployment & operations guide
docker · k3d · kubectl · helm
# Build all images, create the cluster, and deploy both namespaces
bash deploy.sh
# Already built images? Skip the build step
bash deploy.sh --skip-buildThe script provisions the edgemind k3d cluster, installs Prometheus + Redis, and rolls out the pump-station pipeline, detection agents, server, and dashboard. See SETUP.md for detailed operations, port-forwarding, and troubleshooting.
bash start-port-forwards.sh # forwards dashboard, server, and sensor-sim APIs
# then open the dashboard URL printed by the scriptdocker-compose up # brings up the pipeline + agents + Redis without KubernetesNote: the AI orchestrator needs an
LLM_API_KEY(OpenAI-compatible). Set it as a Kubernetes secret (llm-credentials) or an env var —LLM_BASE_URLandLLM_MODELare configurable. Detection, correlation, and DMD all run without a key; only the root-cause narrative requires the LLM.
Each sensor-sim exposes an HTTP injection API with 7 physical fault modes:
| Mode | Physical meaning |
|---|---|
bearing_fault |
Bearing degradation — axial vibration + bearing-health drop |
imbalance |
Rotational imbalance — radial + tangential vibration |
seal_leak |
Seal degradation — gradual temperature rise |
cavitation |
Fluid cavitation — broadband vibration spike (sustained) |
overheat |
Thermal fault — temperature ramp past safe threshold |
flood |
Emission rate 1 Hz → 10 Hz, triggering a network/CPU cascade |
sensor_noise |
Random noise across all readings |
# Inject a bearing fault on Pump 2 for 300 seconds
kubectl exec -n pump-station \
$(kubectl get pod -n pump-station -l app=sensor-sim-2 -o jsonpath='{.items[0].metadata.name}') \
-- curl -s -X POST http://localhost:8081/inject \
-H 'Content-Type: application/json' \
-d '{"mode":"bearing_fault","duration_s":300}'| # | Scenario | Trigger | What EdgeMind shows |
|---|---|---|---|
| 1 | Network Flood Cascade | flood on sensor-sim-2 | network_flood + cpu_spike → multi-agent cascade verdict |
| 2 | Memory Leak (Feature Extractor) | LEAK_MODE | memory_leak → pre_oom ETA → DMD memory forecast |
| 3 | PVC Fill (Export Data) | continuous alert writes | pvc_fill + write_burst → time-to-full forecast |
| 4 | Bearing Fault → Health Drop | bearing_fault on sensor-sim-2 | pump_health_critical propagated to health-scorer |
The server (edgemind_server/main.py) exposes:
| Method | Endpoint | Returns |
|---|---|---|
GET |
/health |
liveness |
GET |
/api/alerts |
correlated AI incidents |
GET |
/api/findings |
raw agent findings |
GET |
/api/dependency-graph · /api/graph |
live pipeline topology |
GET |
/api/metrics |
per-pod telemetry snapshot |
GET |
/api/agent-status |
agent heartbeats |
GET |
/api/dmd-forecasts |
DMD early-warning findings |
DELETE |
/api/alerts |
clear incidents (resets correlation cooldown) |
POST |
/api/chat |
Copilot query against live context |
WS |
/ws |
real-time push of findings, alerts, metrics |
# Detection agents — all mocked, no cluster needed
pytest edgemind_agents/tests/ -v # CPU, Memory, Storage, Network/Log, Schema
# Pipeline services
pytest sensor_sim/tests/ -v # fault engine, OPC-UA server, inject API
pytest feature_extractor/tests/ -v # feature math
pytest health_scorer/tests/ -v # scoring logic
pytest alert_manager/tests/ -v # enrichment + dedup
pytest batch_sync/tests/ -v # export state, parquet round-tripEdgeMind — built for the ABB Accelerator hackathon. Watching the whole stack, from the outside in.





