openshift · sradco · Jan 7, 2026
diff --git a/enhancements/monitoring/alert-rule-classification-mapping.md b/enhancements/monitoring/alert-rule-classification-mapping.md
@@ -0,0 +1,314 @@
+---
+title: alert-rule-classification-mapping
+authors:
+  - "@sradco"
+reviewers:
+  - "@jan--f"
+  - "@jgbernalp"
+approvers:
+  - "@jan--f"
+  - "@jgbernalp"
+api-approvers:
+  - TBD
+creation-date: 2026-01-07
+last-updated: 2026-01-19
+tracking-link:
+  - ""
+---
+# Alert Rule Classification Mapping and Layer Defaults
+
+## Summary
+This enhancement defines how OpenShift monitoring assigns each alerting rule (and the alerts it emits) a stable, user-visible classification: a `component` and an impact `layer`. It documents the defaulting and fallback behavior, how users can override classification through a persisted ConfigMap record, and how the Alerts API exposes the effective values in a Prometheus compatible response for UI filtering and display.
+
+Related enhancement: [Alerts UI Management](alerts-ui-management.md)
+
+## Motivation
+- Enable computation of component and cluster-level health to help the UI prioritize clusters needing attention, highlight failing components, and speed troubleshooting in single and multi-cluster views.
+- Provide a single source of truth for classification behavior across backend and UI.
+- Ensure consistent and stable `component` and `layer` values for alerts and rules.
+- Eliminate historical ambiguity where `layer` could be empty, by defining defaulting rules.
+- Document how user overrides are validated and persisted to enable GitOps-friendly workflows.
+
+## Proposal
+
+### User Stories
+- As a cluster admin, I want consistent component and layer on alerts/rules so I can prioritize clusters needing attention and identify failing components quickly.
+- As a platform SRE, I want platform alerts to default to the cluster layer so global issues are clearly surfaced.
+- As an application owner, I want workload alerts to default to the namespace layer so I can filter and troubleshoot within my project scope.
+
+### Goals
+1. Define allowed `layer` values and defaulting rules.
+2. Document classifier-based mapping and fallback behavior.
+3. Standardize persistence and override schema in a per-`PrometheusRule` ConfigMap.
+4. Specify API enrichment fields for alerts and rules, and expected UI filters/columns.
+5. Maintain compatibility with Prometheus/Thanos schemas (additive enrichment only).
+
+### Non-Goals
+- Changing upstream Prometheus/Thanos APIs or schemas.
+- Redefining platform vs user source detection beyond what is documented here.
+- Enforcing a specific UI, this defines the model that UIs should follow.
+
+### Workflow Description
+1) Classify rules using CHA-derived matchers to compute `(layer, component)`.
+2) If the classifier cannot determine a component, fall back to `component=<rule namespace>` and derive `layer` from source (platform → cluster, user → namespace).
+3) Persist the computed classification per `PrometheusRule` and allow user overrides.
+4) Enrich the Alerts API by correlating alerts to relabeled rules and applying the same fallback behavior.
+5) The UI displays and filters by the effective classification.
+
+### API Extensions
+- No new Kubernetes API extensions (no new CRDs/webhooks/aggregated API servers).
+- The console backend exposes/extends alerting APIs used by the UI:
+  - GET `/api/v1/alerting/alerts`: Prometheus-compatible response with additive fields `alertRuleId`, `openshift_io_alert_component`, `openshift_io_alert_layer`.
+  - PATCH `/api/v1/alerting/classification/{ruleId}`: Update the per `alertRuleId` classification record stored in the per PrometheusRule ConfigMap.
+    - Must have (MVP): `openshift_io_alert_rule_component` and `openshift_io_alert_rule_layer`.
+    - Should have (non-MVP): update `openshift_io_alert_rule_component_from`, `openshift_io_alert_rule_layer_from`, and `overridesByMatch`.
+  - PATCH `/api/v1/alerting/classification`: Bulk update the same per `alertRuleId` classification records.
+    - Must have (MVP): `openshift_io_alert_rule_component` and `openshift_io_alert_rule_layer`.
+    - Should have (non-MVP): update dynamic mapping and per-alert-instance overrides.
+  - GET `/api/v1/alerting/rules` read path that surfaces effective classification for rules (post‑relabel), if provided by the backend.
+
+Example payloads:
+
+```json
+{ "openshift_io_alert_rule_component": "kube-apiserver", "openshift_io_alert_rule_layer": "cluster" }
+```
+
+```json
+{
+  "items": [
+    { "ruleId": "rid-1", "openshift_io_alert_rule_component": "kube-apiserver", "openshift_io_alert_rule_layer": "cluster" },
+    { "ruleId": "rid-2", "openshift_io_alert_rule_component": "ns-a", "openshift_io_alert_rule_layer": "namespace" }
+  ]
+}
+```
+
+RBAC (high level):
+- Platform stack: only users who can update alerting definitions in the platform stack should be allowed to update classification for platform `alertRuleId`s.
+- User workload monitoring: only users with edit permissions in the workload namespace containing the PrometheusRule should be allowed to update classification for those `alertRuleId`s.
+- The backend API is responsible for enforcing this policy before persisting updates to the ConfigMap.
+
+## Terminology
+- component: Logical owner of the alert or rule (e.g., `kube-apiserver`, `etcd`, a namespace, a team).
+- layer: Impact scope. Allowed values: `cluster`, `namespace`.
+- source: Origin of the rule/alert. Either `platform` (cluster monitoring stack) or `user` (User workload monitoring).
+- platform stack: The `openshift-monitoring` stack managed by Red Hat–supported operators.
+- user stack (User workload monitoring): User monitoring stack for application namespaces.
+
+## Mapping Logic
+### Primary Mapping (Classifier)
+- The backend uses a classifier (CHA-derived matchers) to compute a `(layer, component)` tuple from rule/alert labels.
+- Typical mappings:
+  - Core control-plane components → `layer=cluster`, `component=<cp-subsystem>`
+  - Node/compute-related → `layer=cluster` and `component=compute`.
+  - Workload/namespace-level alerts → `layer=namespace`
+
+Rule scoped default classification labels:
+- If an alerting rule includes `openshift_io_alert_rule_component` and or `openshift_io_alert_rule_layer`, the backend uses those values as the default `component` and `layer` for alerts emitted by that rule, unless a user override replaces them.
+
+Dynamic mapping configuration stored in the per‑PrometheusRule ConfigMap:
+- If `openshift_io_alert_rule_component_from` is present in the per‑PrometheusRule ConfigMap entry for the matching `alertRuleId`, the backend derives the alert `component` from the specified alert label key at request time (opt-in dynamic mapping). Initial allowed values: `name` and `component`.
+- If `openshift_io_alert_rule_layer_from` is present in the per‑PrometheusRule ConfigMap entry for the matching `alertRuleId`, the backend derives the alert `layer` from the specified alert label key at request time (opt-in dynamic mapping). Initial allowed values: `layer`. Invalid values are ignored and normal mapping continues.
+
+### Precedence and matching semantics
+The backend should compute the effective classification using this order, highest priority first:
+1) Per alert instance overrides.
+2) Explicit user override fields in the ConfigMap entry: `openshift_io_alert_rule_component`, `openshift_io_alert_rule_layer`.
+3) Dynamic mapping (should have, non-MVP): `openshift_io_alert_rule_component_from`, `openshift_io_alert_rule_layer_from` (only for fields not explicitly set in step 2).
+4) Known-family dynamic mapping implemented in code (for example CVO `name`), where applicable.
+5) Rule-scoped default labels on the alerting rule: `openshift_io_alert_rule_component`, `openshift_io_alert_rule_layer`.
+6) Classifier-based mapping (CHA-derived matchers).
+7) Fallback mapping from source and namespace when component is unknown.
+
+Matcher semantics for `overridesByMatch` (should have, non-MVP):
+- **Exact match only**: each `match` is a map of label key to value.
+- **AND across keys**: a matcher matches when all key/value pairs are present on the alert.
+- **First match wins**: evaluate the list in order. Authors should place more specific matchers earlier.
+
+Example (optional schema extension within a single `alertRuleId` entry):
+```yaml
+<alertRuleIdA>:
+  openshift_io_alert_rule_component_from: name
+  openshift_io_alert_rule_layer: cluster
+  overridesByMatch:
+    - match:
+        name: kube-apiserver
+      openshift_io_alert_rule_component: kube-apiserver
+    - match:
+        name: etcd
+      openshift_io_alert_rule_component: etcd
+```
+
+### Alert-specific dynamic classification (examples: CVO alerts)
+Some alert families cannot be classified purely from the alerting rule definition because the component is derived from runtime alert labels that vary per alert instance. A common example is Cluster Version Operator alerts where the component is derived from the alert label `name` which identifies the ClusterOperator.
+
+For these alerts, the backend should compute component and layer per alert instance using alert labels, even if the underlying rule has a static classification.
+
+Example logic:
+- If `alertname` is `ClusterOperatorDown` or `ClusterOperatorDegraded`
+  - `layer = cluster`
+  - `component = <labels.name>`, and if `name` is missing then use `component = version`
+
+This matches the cluster-health-analyzer approach and enables “dynamic” per alert component mapping without requiring users to split rules.
+
+### Fallback Mapping (When component is unknown)
+If the classifier returns an empty component or `Others`:
+- `component = <PrometheusRule namespace>`
+- `layer` is derived from `source`:
+  - `platform` → `cluster`
+  - `user` → `namespace`
+
+Notes:
+- The backend no longer generates an empty `layer`. Generated values are always one of `cluster|namespace`.
+
+
+### Source Determination
+- For rules: a rule is considered `platform` if it belongs to the cluster monitoring namespace (`openshift-monitoring`). Otherwise it is `user`.
+- For alerts: considered `platform` when either:
+  - `openshift_io_alert_source == platform`, or
+  - `prometheus` label is prefixed with `openshift-monitoring/`.
+  Otherwise `user`.
+
+## Persistence and Overrides
+### Per‑PrometheusRule ConfigMap
+- Name: `alertrule-classification-<prometheusrule-name>`
+- Namespace: same namespace as the `PrometheusRule`
+- OwnerReference: points to the `PrometheusRule`
+- Annotation: a stable signature used for traceability
+- Data key: `alert-rule-classification.yaml`
+- Value: YAML map from `alertRuleId` → object:
+  - `openshift_io_alert_rule_component: <string>`
+  - `openshift_io_alert_rule_layer: <string>`
+  - `errors: [ ... ]` (optional: set when validation fails)
+  Should have (non-MVP):
+  - `openshift_io_alert_rule_component_from: <string>` to derive `component` from a runtime alert label (opt-in dynamic mapping)
+  - `openshift_io_alert_rule_layer_from: <string>` to derive `layer` from a runtime alert label (opt-in dynamic mapping)
+  - `overridesByMatch: [ ... ]` for per-alert-instance overrides based on alert labels
+
+Schema notes:
+- `openshift_io_alert_rule_component` and `openshift_io_alert_rule_layer` are the externally stored, rule-scoped defaults.
+- `openshift_io_alert_rule_component_from` and `openshift_io_alert_rule_layer_from` are optional dynamic mapping, and should be restricted to an allowlist of label keys.
+- `overridesByMatch` is an optional list, each entry should have:
+  - `id: <string>` (recommended for stable updates)
+  - `match: { <labelKey>: <value>, ... }`
+  - `openshift_io_alert_rule_component: <string>` and or `openshift_io_alert_rule_layer: <string>`
+
+Example:
+```yaml
+<alertRuleIdA>:
+  openshift_io_alert_rule_component: kube-apiserver
+  openshift_io_alert_rule_layer: cluster
+  # should have (non-MVP): derive component dynamically from an alert label key
+  openshift_io_alert_rule_component_from: name
+  # should have (non-MVP): per-alert-instance overrides similar to silences
+  overridesByMatch:
+    - id: "ovr-1"
+      match:
+        name: kube-apiserver
+      openshift_io_alert_rule_component: kube-apiserver
+    - id: "ovr-2"
+      match:
+        name: etcd
+      openshift_io_alert_rule_component: etcd
+<alertRuleIdB>:
+  openshift_io_alert_rule_component: ns-a
+  openshift_io_alert_rule_layer: namespace
+```
+
+### User Overrides
+- Users may override `openshift_io_alert_rule_component` and/or `openshift_io_alert_rule_layer` by editing the same ConfigMap.
+- Validation:
+  - `openshift_io_alert_rule_component`: non-empty, 1–253 chars, `[A-Za-z0-9._-]`, must start/end alphanumeric.
+  - `openshift_io_alert_rule_layer`: one of `cluster|namespace`.
+- Invalid overrides are preserved but annotated with `errors` and ignored for effective values.
+- Unknown `alertRuleId` entries are ignored.
+
+## Alerts API Enrichment
+- Endpoint aligns with Prometheus `/api/v1/alerts` and adds fields (additive):
+  - `alertRuleId`
+  - `openshift_io_alert_component`
+  - `openshift_io_alert_layer`
+- Classification for alerts is computed by correlating alerts to relabeled rules and using the effective rule classification as a default. For alert families that require dynamic classification (for example CVO alerts), the backend computes `component` and `layer` per alert instance from alert labels and uses that result. When correlation fails, the fallback mapping above applies and derives `layer` from `source`.
+
+Mechanisms to achieve dynamic classification for specific alerts:
+- Backend runtime mapping: compute component and layer from alert labels at request time, for example CVO alerts using `name`.
+- Dynamic classification is implemented in the backend mapping logic, not via relabeling.
+
+Notes:
+- When opt-in dynamic mapping is configured on a rule, the backend can derive effective values per alert instance and populate `openshift_io_alert_component` and `openshift_io_alert_layer` for that alert.
+- The backend should not add unprefixed `component` or `layer` labels to alerts, to avoid clashing with existing user labels. Use the prefixed `openshift_io_alert_component` and `openshift_io_alert_layer` fields.
+
+## UI Alignment
+- Columns for both Alerts and Alerting Rules should include `Layer` and `Component`.
+- Filters should include `Layer (cluster|namespace)` and `Source (platform|user)`.
+- Creation/edit flows should allow choosing `layer` from the allowed set. `component` free-form (validated).
+- An admin-facing “Manage layers” section can describe the meaning of layers:
+  - Cluster: control plane, cluster-wide components (API server, etcd, network, …)
+  - Namespace: workloads and components scoped to a project/namespace
+
+### Implementation Details/Notes/Constraints
+- Classification is computed server-side using CHA-derived matchers and persisted per `PrometheusRule` in a namespaced ConfigMap. Users may override `openshift_io_alert_rule_component` and `openshift_io_alert_rule_layer` in the same ConfigMap with validation.
+- Alerts are enriched additively (Prometheus-compatible), correlating to relabeled rules where possible and applying source-based defaults on fallback.
+- No new CRDs or aggregated API servers are introduced; standard RBAC applies.
+
+### Topology Considerations
+#### Hypershift / Hosted Control Planes
+
+
+#### Standalone Clusters
+
+#### Single-node Deployments or MicroShift
+
+#### OpenShift Kubernetes Engine
+
+## Upgrade / Downgrade Strategy
+- User overrides remain intact. only invalid values are annotated with `errors`.
+
+## Test Plan (High Level)
+- Unit tests:
+  - Unknown component fallback for user rules → `layer=namespace`, `component=<rule ns>`.
+  - Unknown component fallback for platform rules → `layer=cluster`, `component=<rule ns>`.
+  - Valid overrides are merged. Invalid overrides are recorded in `errors` and ignored.
+  - Signature annotation stored and updated deterministically.
+- Integration/e2e (as available):
+  - ConfigMap creation/update on rule changes.
+  - Alerts API includes additive fields and respects relabel configs.
+
+### Risks and Mitigations
+- Misclassification by classifier: mitigated by clear overrides and validation paths.
+- Drift between docs and implementation: mitigated by this enhancement and regular verification in tests.
+- Client assumptions about additional `layer` values: documented allowed set and guidance to pass through unknown values without interpretation.
+
+### Drawbacks
+- Additional reconciliation and ConfigMap writes on rule changes.
+- Classifier rules require maintenance as platform components evolve.
+
+## Alternatives (Not Implemented)
+- Setting the labels with alertRelabelConfig CR for all alerts, except for operator alerts in User workload monitoring.
+- Introduce a dedicated classification CRD (adds operational overhead with limited benefit).
+- Compute classification only in the UI (duplicates logic, hard to validate).
+
+## Graduation Criteria
+
+### Dev Preview -> Tech Preview
+- End-to-end classification (compute classification, persist, enrich) with unit tests and docs.
+- UI consumes `component`/`layer` for display and filtering.
+
+### Tech Preview -> GA
+- Full test coverage (upgrade/downgrade/scale).
+- Stable defaulting across supported topologies (standalone, Hypershift, SNO/MicroShift).
+
+### Removing a deprecated feature
+- If the classifier or persistence format changes, document migration and keep backward compatibility for one minor release.
+
+## Version Skew Strategy
+- Server-side enrichment ensures older/newer UIs receive consistent fields. Unknown `layer` values must be passed through and displayed as-is.
+
+## Operational Aspects of API Extensions
+- No new API extensions are introduced. OwnerReferences ensure GC of ConfigMaps. Failures surface in controller logs.
+
+## Support Procedures
+- Verify `alertrule-classification-<prometheusrule>` ConfigMaps and their `errors` fields.
+- Check controller logs for validation failures.
+- Confirm alert `prometheus` or `openshift_io_alert_source` labels for source detection.
+## Open Questions
+