Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
314 changes: 314 additions & 0 deletions enhancements/monitoring/alert-rule-classification-mapping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,314 @@
---
title: alert-rule-classification-mapping
authors:
- "@sradco"
reviewers:
- "@jan--f"
- "@jgbernalp"
approvers:
- "@jan--f"
- "@jgbernalp"
api-approvers:
- TBD
creation-date: 2026-01-07
last-updated: 2026-01-19
tracking-link:
- ""
---
# Alert Rule Classification Mapping and Layer Defaults

## Summary
This enhancement defines how OpenShift monitoring assigns each alerting rule (and the alerts it emits) a stable, user-visible classification: a `component` and an impact `layer`. It documents the defaulting and fallback behavior, how users can override classification through a persisted ConfigMap record, and how the Alerts API exposes the effective values in a Prometheus compatible response for UI filtering and display.

Related enhancement: [Alerts UI Management](alerts-ui-management.md)

## Motivation
- Enable computation of component and cluster-level health to help the UI prioritize clusters needing attention, highlight failing components, and speed troubleshooting in single and multi-cluster views.
- Provide a single source of truth for classification behavior across backend and UI.
- Ensure consistent and stable `component` and `layer` values for alerts and rules.
- Eliminate historical ambiguity where `layer` could be empty, by defining defaulting rules.
- Document how user overrides are validated and persisted to enable GitOps-friendly workflows.

## Proposal

### User Stories
- As a cluster admin, I want consistent component and layer on alerts/rules so I can prioritize clusters needing attention and identify failing components quickly.
- As a platform SRE, I want platform alerts to default to the cluster layer so global issues are clearly surfaced.
- As an application owner, I want workload alerts to default to the namespace layer so I can filter and troubleshoot within my project scope.

### Goals
1. Define allowed `layer` values and defaulting rules.
2. Document classifier-based mapping and fallback behavior.
3. Standardize persistence and override schema in a per-`PrometheusRule` ConfigMap.
4. Specify API enrichment fields for alerts and rules, and expected UI filters/columns.
5. Maintain compatibility with Prometheus/Thanos schemas (additive enrichment only).

### Non-Goals
- Changing upstream Prometheus/Thanos APIs or schemas.
- Redefining platform vs user source detection beyond what is documented here.
- Enforcing a specific UI, this defines the model that UIs should follow.

### Workflow Description
1) Classify rules using CHA-derived matchers to compute `(layer, component)`.
2) If the classifier cannot determine a component, fall back to `component=<rule namespace>` and derive `layer` from source (platform → cluster, user → namespace).
3) Persist the computed classification per `PrometheusRule` and allow user overrides.
4) Enrich the Alerts API by correlating alerts to relabeled rules and applying the same fallback behavior.
5) The UI displays and filters by the effective classification.

### API Extensions
- No new Kubernetes API extensions (no new CRDs/webhooks/aggregated API servers).
- The console backend exposes/extends alerting APIs used by the UI:
- GET `/api/v1/alerting/alerts`: Prometheus-compatible response with additive fields `alertRuleId`, `openshift_io_alert_component`, `openshift_io_alert_layer`.
- PATCH `/api/v1/alerting/classification/{ruleId}`: Update the per `alertRuleId` classification record stored in the per PrometheusRule ConfigMap.
- Must have (MVP): `openshift_io_alert_rule_component` and `openshift_io_alert_rule_layer`.
- Should have (non-MVP): update `openshift_io_alert_rule_component_from`, `openshift_io_alert_rule_layer_from`, and `overridesByMatch`.
- PATCH `/api/v1/alerting/classification`: Bulk update the same per `alertRuleId` classification records.
- Must have (MVP): `openshift_io_alert_rule_component` and `openshift_io_alert_rule_layer`.
- Should have (non-MVP): update dynamic mapping and per-alert-instance overrides.
- GET `/api/v1/alerting/rules` read path that surfaces effective classification for rules (post‑relabel), if provided by the backend.

Example payloads:

```json
{ "openshift_io_alert_rule_component": "kube-apiserver", "openshift_io_alert_rule_layer": "cluster" }
```

```json
{
"items": [
{ "ruleId": "rid-1", "openshift_io_alert_rule_component": "kube-apiserver", "openshift_io_alert_rule_layer": "cluster" },
{ "ruleId": "rid-2", "openshift_io_alert_rule_component": "ns-a", "openshift_io_alert_rule_layer": "namespace" }
]
}
```

RBAC (high level):
- Platform stack: only users who can update alerting definitions in the platform stack should be allowed to update classification for platform `alertRuleId`s.
- User workload monitoring: only users with edit permissions in the workload namespace containing the PrometheusRule should be allowed to update classification for those `alertRuleId`s.
- The backend API is responsible for enforcing this policy before persisting updates to the ConfigMap.

## Terminology
- component: Logical owner of the alert or rule (e.g., `kube-apiserver`, `etcd`, a namespace, a team).
- layer: Impact scope. Allowed values: `cluster`, `namespace`.
- source: Origin of the rule/alert. Either `platform` (cluster monitoring stack) or `user` (User workload monitoring).
- platform stack: The `openshift-monitoring` stack managed by Red Hat–supported operators.
- user stack (User workload monitoring): User monitoring stack for application namespaces.

## Mapping Logic
### Primary Mapping (Classifier)
- The backend uses a classifier (CHA-derived matchers) to compute a `(layer, component)` tuple from rule/alert labels.
- Typical mappings:
- Core control-plane components → `layer=cluster`, `component=<cp-subsystem>`
- Node/compute-related → `layer=cluster` and `component=compute`.
- Workload/namespace-level alerts → `layer=namespace`

Rule scoped default classification labels:
- If an alerting rule includes `openshift_io_alert_rule_component` and or `openshift_io_alert_rule_layer`, the backend uses those values as the default `component` and `layer` for alerts emitted by that rule, unless a user override replaces them.

Dynamic mapping configuration stored in the per‑PrometheusRule ConfigMap:
- If `openshift_io_alert_rule_component_from` is present in the per‑PrometheusRule ConfigMap entry for the matching `alertRuleId`, the backend derives the alert `component` from the specified alert label key at request time (opt-in dynamic mapping). Initial allowed values: `name` and `component`.
- If `openshift_io_alert_rule_layer_from` is present in the per‑PrometheusRule ConfigMap entry for the matching `alertRuleId`, the backend derives the alert `layer` from the specified alert label key at request time (opt-in dynamic mapping). Initial allowed values: `layer`. Invalid values are ignored and normal mapping continues.

### Precedence and matching semantics
The backend should compute the effective classification using this order, highest priority first:
1) Per alert instance overrides.
2) Explicit user override fields in the ConfigMap entry: `openshift_io_alert_rule_component`, `openshift_io_alert_rule_layer`.
3) Dynamic mapping (should have, non-MVP): `openshift_io_alert_rule_component_from`, `openshift_io_alert_rule_layer_from` (only for fields not explicitly set in step 2).
4) Known-family dynamic mapping implemented in code (for example CVO `name`), where applicable.
5) Rule-scoped default labels on the alerting rule: `openshift_io_alert_rule_component`, `openshift_io_alert_rule_layer`.
6) Classifier-based mapping (CHA-derived matchers).
7) Fallback mapping from source and namespace when component is unknown.

Matcher semantics for `overridesByMatch` (should have, non-MVP):
- **Exact match only**: each `match` is a map of label key to value.
- **AND across keys**: a matcher matches when all key/value pairs are present on the alert.
- **First match wins**: evaluate the list in order. Authors should place more specific matchers earlier.

Example (optional schema extension within a single `alertRuleId` entry):
```yaml
<alertRuleIdA>:
openshift_io_alert_rule_component_from: name
openshift_io_alert_rule_layer: cluster
overridesByMatch:
- match:
name: kube-apiserver
openshift_io_alert_rule_component: kube-apiserver
- match:
name: etcd
openshift_io_alert_rule_component: etcd
```

### Alert-specific dynamic classification (examples: CVO alerts)
Some alert families cannot be classified purely from the alerting rule definition because the component is derived from runtime alert labels that vary per alert instance. A common example is Cluster Version Operator alerts where the component is derived from the alert label `name` which identifies the ClusterOperator.

For these alerts, the backend should compute component and layer per alert instance using alert labels, even if the underlying rule has a static classification.

Example logic:
- If `alertname` is `ClusterOperatorDown` or `ClusterOperatorDegraded`
- `layer = cluster`
- `component = <labels.name>`, and if `name` is missing then use `component = version`

This matches the cluster-health-analyzer approach and enables “dynamic” per alert component mapping without requiring users to split rules.

### Fallback Mapping (When component is unknown)
If the classifier returns an empty component or `Others`:
- `component = <PrometheusRule namespace>`
- `layer` is derived from `source`:
- `platform` → `cluster`
- `user` → `namespace`

Notes:
- The backend no longer generates an empty `layer`. Generated values are always one of `cluster|namespace`.


### Source Determination
- For rules: a rule is considered `platform` if it belongs to the cluster monitoring namespace (`openshift-monitoring`). Otherwise it is `user`.
- For alerts: considered `platform` when either:
- `openshift_io_alert_source == platform`, or
- `prometheus` label is prefixed with `openshift-monitoring/`.
Otherwise `user`.

## Persistence and Overrides
### Per‑PrometheusRule ConfigMap
- Name: `alertrule-classification-<prometheusrule-name>`
- Namespace: same namespace as the `PrometheusRule`
- OwnerReference: points to the `PrometheusRule`
- Annotation: a stable signature used for traceability
- Data key: `alert-rule-classification.yaml`
- Value: YAML map from `alertRuleId` → object:
- `openshift_io_alert_rule_component: <string>`
- `openshift_io_alert_rule_layer: <string>`
- `errors: [ ... ]` (optional: set when validation fails)
Should have (non-MVP):
- `openshift_io_alert_rule_component_from: <string>` to derive `component` from a runtime alert label (opt-in dynamic mapping)
- `openshift_io_alert_rule_layer_from: <string>` to derive `layer` from a runtime alert label (opt-in dynamic mapping)
- `overridesByMatch: [ ... ]` for per-alert-instance overrides based on alert labels

Schema notes:
- `openshift_io_alert_rule_component` and `openshift_io_alert_rule_layer` are the externally stored, rule-scoped defaults.
- `openshift_io_alert_rule_component_from` and `openshift_io_alert_rule_layer_from` are optional dynamic mapping, and should be restricted to an allowlist of label keys.
- `overridesByMatch` is an optional list, each entry should have:
- `id: <string>` (recommended for stable updates)
- `match: { <labelKey>: <value>, ... }`
- `openshift_io_alert_rule_component: <string>` and or `openshift_io_alert_rule_layer: <string>`

Example:
```yaml
<alertRuleIdA>:
openshift_io_alert_rule_component: kube-apiserver
openshift_io_alert_rule_layer: cluster
# should have (non-MVP): derive component dynamically from an alert label key
openshift_io_alert_rule_component_from: name
# should have (non-MVP): per-alert-instance overrides similar to silences
overridesByMatch:
- id: "ovr-1"
match:
name: kube-apiserver
openshift_io_alert_rule_component: kube-apiserver
- id: "ovr-2"
match:
name: etcd
openshift_io_alert_rule_component: etcd
<alertRuleIdB>:
openshift_io_alert_rule_component: ns-a
openshift_io_alert_rule_layer: namespace
```

### User Overrides
- Users may override `openshift_io_alert_rule_component` and/or `openshift_io_alert_rule_layer` by editing the same ConfigMap.
- Validation:
- `openshift_io_alert_rule_component`: non-empty, 1–253 chars, `[A-Za-z0-9._-]`, must start/end alphanumeric.
- `openshift_io_alert_rule_layer`: one of `cluster|namespace`.
- Invalid overrides are preserved but annotated with `errors` and ignored for effective values.
- Unknown `alertRuleId` entries are ignored.

## Alerts API Enrichment
- Endpoint aligns with Prometheus `/api/v1/alerts` and adds fields (additive):
- `alertRuleId`
- `openshift_io_alert_component`
- `openshift_io_alert_layer`
- Classification for alerts is computed by correlating alerts to relabeled rules and using the effective rule classification as a default. For alert families that require dynamic classification (for example CVO alerts), the backend computes `component` and `layer` per alert instance from alert labels and uses that result. When correlation fails, the fallback mapping above applies and derives `layer` from `source`.

Mechanisms to achieve dynamic classification for specific alerts:
- Backend runtime mapping: compute component and layer from alert labels at request time, for example CVO alerts using `name`.
- Dynamic classification is implemented in the backend mapping logic, not via relabeling.

Notes:
- When opt-in dynamic mapping is configured on a rule, the backend can derive effective values per alert instance and populate `openshift_io_alert_component` and `openshift_io_alert_layer` for that alert.
- The backend should not add unprefixed `component` or `layer` labels to alerts, to avoid clashing with existing user labels. Use the prefixed `openshift_io_alert_component` and `openshift_io_alert_layer` fields.

## UI Alignment
- Columns for both Alerts and Alerting Rules should include `Layer` and `Component`.
- Filters should include `Layer (cluster|namespace)` and `Source (platform|user)`.
- Creation/edit flows should allow choosing `layer` from the allowed set. `component` free-form (validated).
- An admin-facing “Manage layers” section can describe the meaning of layers:
- Cluster: control plane, cluster-wide components (API server, etcd, network, …)
- Namespace: workloads and components scoped to a project/namespace

### Implementation Details/Notes/Constraints
- Classification is computed server-side using CHA-derived matchers and persisted per `PrometheusRule` in a namespaced ConfigMap. Users may override `openshift_io_alert_rule_component` and `openshift_io_alert_rule_layer` in the same ConfigMap with validation.
- Alerts are enriched additively (Prometheus-compatible), correlating to relabeled rules where possible and applying source-based defaults on fallback.
- No new CRDs or aggregated API servers are introduced; standard RBAC applies.

### Topology Considerations
#### Hypershift / Hosted Control Planes


#### Standalone Clusters

#### Single-node Deployments or MicroShift

#### OpenShift Kubernetes Engine

## Upgrade / Downgrade Strategy
- User overrides remain intact. only invalid values are annotated with `errors`.

## Test Plan (High Level)
- Unit tests:
- Unknown component fallback for user rules → `layer=namespace`, `component=<rule ns>`.
- Unknown component fallback for platform rules → `layer=cluster`, `component=<rule ns>`.
- Valid overrides are merged. Invalid overrides are recorded in `errors` and ignored.
- Signature annotation stored and updated deterministically.
- Integration/e2e (as available):
- ConfigMap creation/update on rule changes.
- Alerts API includes additive fields and respects relabel configs.

### Risks and Mitigations
- Misclassification by classifier: mitigated by clear overrides and validation paths.
- Drift between docs and implementation: mitigated by this enhancement and regular verification in tests.
- Client assumptions about additional `layer` values: documented allowed set and guidance to pass through unknown values without interpretation.

### Drawbacks
- Additional reconciliation and ConfigMap writes on rule changes.
- Classifier rules require maintenance as platform components evolve.

## Alternatives (Not Implemented)
- Setting the labels with alertRelabelConfig CR for all alerts, except for operator alerts in User workload monitoring.
- Introduce a dedicated classification CRD (adds operational overhead with limited benefit).
- Compute classification only in the UI (duplicates logic, hard to validate).

## Graduation Criteria

### Dev Preview -> Tech Preview
- End-to-end classification (compute classification, persist, enrich) with unit tests and docs.
- UI consumes `component`/`layer` for display and filtering.

### Tech Preview -> GA
- Full test coverage (upgrade/downgrade/scale).
- Stable defaulting across supported topologies (standalone, Hypershift, SNO/MicroShift).

### Removing a deprecated feature
- If the classifier or persistence format changes, document migration and keep backward compatibility for one minor release.

## Version Skew Strategy
- Server-side enrichment ensures older/newer UIs receive consistent fields. Unknown `layer` values must be passed through and displayed as-is.

## Operational Aspects of API Extensions
- No new API extensions are introduced. OwnerReferences ensure GC of ConfigMaps. Failures surface in controller logs.

## Support Procedures
- Verify `alertrule-classification-<prometheusrule>` ConfigMaps and their `errors` fields.
- Check controller logs for validation failures.
- Confirm alert `prometheus` or `openshift_io_alert_source` labels for source detection.
## Open Questions