Skip to content

Commit 35d8b25

Browse files
pavolloffaymax-cx
authored andcommitted
Add span RED alerting docs
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
1 parent 1b331a1 commit 35d8b25

File tree

1 file changed

+52
-0
lines changed

1 file changed

+52
-0
lines changed

modules/distr-tracing-tempo-config-spanmetrics.adoc

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,3 +92,55 @@ spec:
9292
----
9393
<1> Enables the monitoring tab in the Jaeger console.
9494
<2> The service name for Thanos Querier from user-workload monitoring.
95+
96+
== Span RED metrics and alerting rules
97+
98+
The metrics generated by the `spanmetrics` connector are usable with alerting rules. For example, for alerts about a slow service or to define service level objectives (SLOs), the connector creates a `duration_bucket` histogram and the `calls` counter metric. These metrics have labels that identify the service, API name, operation type, and other attributes.
99+
100+
.Labels of the metrics created in the `spanmetrics` connector
101+
[options="header"]
102+
[cols="l, a, a"]
103+
|===
104+
|Label |Description |Values
105+
106+
|service_name
107+
|Service name set by the `otel_service_name` environment variable.
108+
|`frontend`
109+
110+
|span_name
111+
| Name of the operation.
112+
|
113+
* `/`
114+
* `/customer`
115+
116+
|span_kind
117+
|Identifies the server, client, messaging, or internal operation.
118+
|
119+
* `SPAN_KIND_SERVER`
120+
* `SPAN_KIND_CLIENT`
121+
* `SPAN_KIND_PRODUCER`
122+
* `SPAN_KIND_CONSUMER`
123+
* `SPAN_KIND_INTERNAL`
124+
125+
|===
126+
127+
.Example PrometheusRule CR that defines an alerting rule for SLO when not serving 95% of requests within 2000ms on the front-end service
128+
[source,yaml]
129+
----
130+
apiVersion: monitoring.coreos.com/v1
131+
kind: PrometheusRule
132+
metadata:
133+
name: span-red
134+
spec:
135+
groups:
136+
- name: server-side-latency
137+
rules:
138+
- alert: SpanREDFrontendAPIRequestLatency
139+
expr: histogram_quantile(0.95, sum(rate(duration_bucket{service_name="frontend", span_kind="SPAN_KIND_SERVER"}[5m])) by (le, service_name, span_name)) > 2000 # <1>
140+
labels:
141+
severity: Warning
142+
annotations:
143+
summary: "High request latency on {{$labels.service_name}} and {{$labels.span_name}}"
144+
description: "{{$labels.instance}} has 95th request latency above 2s (current value: {{$value}}s)"
145+
----
146+
<1> The expression for checking if 95% of the front-end server response time values are below 2000 ms. The time range (`[5m]`) must be at least four times the scrape interval and long enough to accommodate a change in the metric.

0 commit comments

Comments
 (0)