Skip to content

Commit 05978ea

Browse files
authored
SLO and Error Budget monitoring (#1)
1 parent 68e84f4 commit 05978ea

File tree

9 files changed

+216
-101
lines changed

9 files changed

+216
-101
lines changed

README.md

Lines changed: 154 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
# Prometheus For Developers
22

3-
This is an introductory tutorial/workshop I created for telling the software
4-
developers in my [company](https://descomplica.com.br) the basics about
3+
This is an introductory tutorial I created for telling the software developers
4+
in my [company](https://descomplica.com.br) the basics about
55
[Prometheus](https://prometheus.io).
66

77
If you have any suggestion to improve this content, don't hesitate to contact
88
me. Pull Requests are welcome!
99

1010
## Table of Contents
1111

12-
- [Workshop Project](#workshop-project)
12+
- [The Project](#the-project)
1313
- [Pre-Requisites](#pre-requisites)
1414
- [Running the Code](#running-the-code)
1515
- [Cleaning Up](#cleaning-up)
@@ -24,13 +24,14 @@ me. Pull Requests are welcome!
2424
- [Quantile Estimation Errors](#quantile-estimation-errors)
2525
- [Measuring Throughput](#measuring-throughput)
2626
- [Measuring Memory/CPU Usage](#measuring-memorycpu-usage)
27+
- [Measuring SLOs and Error Budgets](#measuring-slos-and-error-budgets)
2728
- [Monitoring Applications Without a Metrics Endpoint](#monitoring-applications-without-a-metrics-endpoint)
2829
- [Final Gotchas](#final-gotchas)
2930
- [References](#references)
3031

31-
## Workshop Project
32+
## The Project
3233

33-
This workshop follows a more practical approach (with hopefully just the
34+
This tutorial follows a more practical approach (with hopefully just the
3435
right amount of theory!), so we provide a simple Docker Compose configuration
3536
for simplifying the project bootstrap.
3637

@@ -311,29 +312,27 @@ to the correct receiver integration (i.e. email, Slack, PagerDuty,
311312
OpsGenie). It also takes care of silencing and inhibition of alerts.
312313

313314
Configuring Alertmanager to send metrics to PagerDuty, or Slack, or whatever,
314-
is out of the scope of this workshop, but we can still play around with alerts.
315+
is out of the scope of this tutorial, but we can still play around with alerts.
315316

316-
Let's define our first alerting rule in
317+
We already have the following alerting rule defined in
317318
`config/prometheus/prometheus.rules.yml`:
318319

319320
```yaml
320-
# Uptime alerting rule
321321
groups:
322-
- name: uptime
323-
rules:
324-
- alert: ServerDown
325-
expr: up == 0
326-
for: 1m
327-
labels:
328-
severity: page
329-
annotations:
330-
summary: One or more targets are down
331-
description: Instance {{ $labels.instance }} of {{ $labels.job }} is down
322+
- name: uptime
323+
rules:
324+
# Uptime alerting rule
325+
# Ref: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
326+
- alert: ServerDown
327+
expr: up == 0
328+
for: 1m
329+
labels:
330+
severity: page
331+
annotations:
332+
summary: One or more targets are down
333+
description: Instance {{ $labels.instance }} of {{ $labels.job }} is down
332334
```
333335

334-
Restart Prometheus with `docker-compose restart prometheus` and open the Alerts
335-
page at <http://localhost:9090/alerts>.
336-
337336
![Prometheus alerts](./img/prometheus-alerts-1.png)
338337

339338
Each alerting rule in Prometheus is also a time series, so in this case you can
@@ -363,7 +362,7 @@ should go back to a green state after a few seconds.
363362

364363
## Instrumenting Your Applications
365364

366-
Let's examine a sample Node.js application we created for this workshop.
365+
Let's examine a sample Node.js application we created for this tutorial.
367366

368367
Open the `./sample-app/index.js` file in your favorite text editor. The
369368
code is fully commented, so you should not have a hard time understanding
@@ -373,15 +372,21 @@ it.
373372

374373
We can measure request durations with
375374
[percentiles](https://en.wikipedia.org/wiki/Quantile) or
376-
[averages](https://en.wikipedia.org/wiki/Arithmetic_mean). However,
377-
it's not recommended relying on averages to track request durations because
378-
averages can be very misleading (see the [References](#references) for a few
379-
posts on the pitfalls of averages and how percentiles can help).
375+
[averages](https://en.wikipedia.org/wiki/Arithmetic_mean). It's not
376+
recommended relying on averages to track request durations because averages
377+
can be very misleading (see the [References](#references) for a few posts on
378+
the pitfalls of averages and how percentiles can help). A better way for
379+
measuring durations is via percentiles as it tracks the user experience
380+
more closely:
381+
382+
![Percentiles as a way to measure user satisfaction](./img/percentiles.jpg)
383+
Source: [Twitter](https://twitter.com/rakyll/status/1045075510538035200)
380384

381385
In Prometheus, we can generate percentiles with summaries or histograms.
382386

383387
To show the differences between these two, our sample application exposes
384-
two custom metrics for measuring request durations with:
388+
two custom metrics for measuring request durations with, one via a summary
389+
and the other via a histogram:
385390

386391
```js
387392
// Summary metric for measuring request durations
@@ -394,7 +399,7 @@ const requestDurationSummary = new prometheusClient.Summary({
394399
395400
// Extra dimensions, or labels
396401
// HTTP method (GET, POST, etc), and status code (200, 500, etc)
397-
labelNames: ['method', 'statuscode'],
402+
labelNames: ['method', 'status'],
398403
399404
// 50th (median), 75th, 90th, 95th, and 99th percentiles
400405
percentiles: [0.5, 0.75, 0.9, 0,95, 0.99]
@@ -410,19 +415,19 @@ const requestDurationHistogram = new prometheusClient.Histogram({
410415
411416
// Extra dimensions, or labels
412417
// HTTP method (GET, POST, etc), and status code (200, 500, etc)
413-
labelNames: ['method', 'statuscode'],
418+
labelNames: ['method', 'status'],
414419
415420
// Duration buckets, in seconds
416421
// 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s
417422
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
418423
});
419424
```
420425

421-
As you can see, in a summary we specify the percentiles in which we
422-
want the Prometheus client to calculate and report latencies, while in a
423-
histogram we specify the buckets where the observed durations will be
424-
classified (i.e. a 300ms duration will be stored in the 250ms-500ms
425-
bucket).
426+
As you can see, in a summary we specify the percentiles in which we want the
427+
Prometheus client to calculate and report latencies for, while in a histogram
428+
we specify the duration buckets in which the observed durations will be stored
429+
as a counter (i.e. a 300ms observation will be stored by incrementing the
430+
counter corresponding to the 250ms-500ms bucket).
426431

427432
Our sample application introduces a one-second delay in approximately 5%
428433
of requests, just so we can compare the average response time with
@@ -470,7 +475,7 @@ rate(sample_app_summary_request_duration_seconds_sum[15s]) / rate(sample_app_sum
470475
sample_app_summary_request_duration_seconds{quantile="0.99"}
471476
472477
# 99th percentile (via histogram)
473-
histogram_quantile(0.99, sum(rate(sample_app_histogram_request_duration_seconds_bucket[15s])) by (le, method, statuscode))
478+
histogram_quantile(0.99, sum(rate(sample_app_histogram_request_duration_seconds_bucket[15s])) by (le, method, status))
474479
```
475480

476481
The result of these queries may seem surprising.
@@ -501,7 +506,7 @@ Quoting the [documentation]():
501506
> `histogram_quantile()` function.
502507

503508
In other words, for the quantile estimation from the buckets of a
504-
histogram to be accurate, we need to be careful when picking the bucket
509+
histogram to be accurate, we need to be careful when choosing the bucket
505510
layout; if it doesn't match the range and distribution of the actual
506511
observed durations, you will get inaccurate quantiles as a result.
507512

@@ -518,7 +523,7 @@ const requestDurationHistogram = new prometheusClient.Histogram({
518523
519524
// Extra dimensions, or labels
520525
// HTTP method (GET, POST, etc), and status code (200, 500, etc)
521-
labelNames: ['method', 'statuscode'],
526+
labelNames: ['method', 'status'],
522527
523528
// Duration buckets, in seconds
524529
// 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s
@@ -527,14 +532,14 @@ const requestDurationHistogram = new prometheusClient.Histogram({
527532
```
528533

529534
Here we are using a _exponential_ bucket configuration in which the buckets
530-
get larger as latency gets bigger. This is a widely used pattern; since we
535+
double in size at every step. This is a widely used pattern; since we
531536
always expect our services to respond quickly (i.e. with response time
532537
between 0 and 300ms), we specify more buckets for that range, and fewer
533538
buckets for request durations we think are less likely to occur.
534539

535540
According to the previous plot, all slow requests from our application
536-
is falling into the 1s-2.5s bucket, causing us to lose precision when
537-
calculating the 99th percentile.
541+
are falling into the 1s-2.5s bucket, resulting in this loss of precision
542+
when calculating the 99th percentile.
538543

539544
Since we know our application will take at most ~1s to respond, we can
540545
choose a more appropriate bucket layout:
@@ -573,10 +578,10 @@ The reason is efficiency. Remember:
573578

574579
**more buckets == more time series == more space == slower queries**
575580

576-
Let's say you have an SLA to serve 99% of requests within 300ms. If all
577-
you want to know is whether you are honoring your SLA or not, it doesn't
578-
really matter if the quantile estimation is not accurate for requests
579-
slower than 300ms.
581+
Let's say you have an SLO (more details on SLOs later) to serve 99% of
582+
requests within 300ms. If all you want to know is whether you are
583+
honoring your SLO or not, it doesn't really matter if the quantile
584+
estimation is not accurate for requests slower than 300ms.
580585

581586
You might also be wondering: if summaries are more precise, why not use
582587
summaries instead of histograms?
@@ -593,7 +598,8 @@ with multiple replicas, you can safely use the `histogram_quantile()`
593598
function to calculate the 99th percentile across all requests to all
594599
replicas. You cannot do this with summaries. I mean, you can `avg()` the
595600
99th percentiles of all replicas, or take the `max()`, but the value you
596-
get will be statistically incorrect.
601+
get will be statistically incorrect and could not be used as a proxy to the
602+
99th percentile.
597603

598604
---
599605

@@ -652,6 +658,108 @@ Grafana server at <http://localhost:3000>.
652658

653659
---
654660

661+
### Measuring SLOs and Error Budgets
662+
663+
> Managing service reliability is largely about managing risk, and managing risk
664+
> can be costly.
665+
>
666+
> 100% is probably never the right reliability target: not only is it impossible
667+
> to achieve, it's typically more reliability than a service's users want or
668+
> notice.
669+
670+
SLOs, or _Service Level Objectives_, is one of the main tools employed by
671+
[Site Reliability Engineers (SREs)](https://landing.google.com/sre/books/) for
672+
making data-driven decisions about reliability.
673+
674+
SLOs are based on SLIs, or _Service Level Indicators_, which are the key metrics
675+
that define how well (or how poorly) a given service is operating. Common SLIs
676+
would be the number of failed requests, the number of requests slower than some
677+
threshold, etc. Although different types of SLOs can be useful for different
678+
types of systems, most HTTP-based services will have SLOs that can be
679+
classified into two categories: **availability** and **latency**.
680+
681+
For instance, let's say these are the SLOs for our sample application:
682+
683+
| Category | SLI | SLO |
684+
|-|-|-|
685+
| Availability | The proportion of successful requests; any HTTP status other than 500-599 is considered successful | 95% successful requests |
686+
| Latency | The proportion of requests with duration less than or equal to 100ms | 95% requests under 100ms |
687+
688+
The difference between 100% and the SLO is what we call the _Error Budget_.
689+
In this example, the error budget for both SLOs is 5%; if the application
690+
receives 1,000 requests during the SLO window (let's say one minute for the
691+
purposes of this tutorial), it means that 50 requests can fail and we'll
692+
still meet our SLO.
693+
694+
But do we need additional metrics for keeping track of these SLOs? Probably
695+
not. If you are tracking request durations with a histogram (as we are here),
696+
chances are you don't need to do anything else. You already got all the
697+
metrics you need!
698+
699+
Let's send a few requests to the server so we can play around with the metrics:
700+
701+
```sh
702+
$ while true; do curl -s http://localhost:4000 > /dev/null ; done
703+
```
704+
705+
```sh
706+
# Number of requests served in the SLO window
707+
sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job)
708+
709+
# Number of requests that violated the latency SLO (all requests that took more than 100ms to be served)
710+
sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - sum(increase(sample_app_histogram_request_duration_seconds_bucket{le="0.1"}[1m])) by (job)
711+
712+
# Number of requests in the error budget: (100% - [slo threshold]) * [number of requests served]
713+
(1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job)
714+
715+
# Remaining requests in the error budget: [number of requests in the error budget] - [number of requests that violated the latency SLO]
716+
(1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - (sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - sum(increase(sample_app_histogram_request_duration_seconds_bucket{le="0.1"}[1m])) by (job))
717+
718+
# Remaining requests in the error budget as a ratio: ([number of requests in the error budget] - [number of requests that violated the SLO]) / [number of requests in the error budget]
719+
((1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - (sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - sum(increase(sample_app_histogram_request_duration_seconds_bucket{le="0.1"}[1m])) by (job))) / ((1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job))
720+
```
721+
722+
Due to the simulated scenario in which ~5% of requests takes 1s to complete,
723+
if you try the last query you should see that the average budget available
724+
is around 0%, that is, we have no more budget to spend and will inevitably
725+
break the latency SLO if more requests start to take more time to be served.
726+
This is not a good place to be.
727+
728+
![Error Budget Burn Rate of 1x](./img/slo-1.png)
729+
730+
But what if we had a more strict SLO, say, 99% instead of 95%? What would be
731+
the impact of these slow requests on the error budget?
732+
733+
Just replace all `0.95` by `0.99` in that query to see what would happen:
734+
735+
![Error Budget Burn Rate of 3x](./img/slo-2.png)
736+
737+
In the previous scenario with the 95% SLO, the SLO _burn rate_ was ~1x, which
738+
means the whole error budget was being consumed during the SLO window, that is,
739+
in 60 seconds. Now, with the 99% SLO, the burn rate was ~3x, which means that
740+
instead of taking one minute for the error budget to exhaust, it now takes
741+
only ~20 seconds!
742+
743+
Now change the `curl` to point to the `/metrics` endpoint, which do not have
744+
the simulated long latency for 5% of the requests, and you should see the error
745+
budget go back to 100% again:
746+
747+
```bash
748+
$ while true; do curl -s http://localhost:4000/metrics > /dev/null ; done
749+
```
750+
751+
![Error Budget Replenished](./img/slo-3.png)
752+
753+
---
754+
755+
**Want to know more?** The
756+
[Site Reliability Workbook](https://landing.google.com/sre/books/) is a great
757+
resource on this topic and includes more advanced concepts such as how to alert
758+
based on SLO burn rate as a way to improve alert precision/recall and
759+
detection/reset times.
760+
761+
---
762+
655763
### Monitoring Applications Without a Metrics Endpoint
656764

657765
We learned that Prometheus needs all applications to expose a `/metrics`
@@ -688,3 +796,4 @@ hard time when creating queries later.
688796
- [Blog Post: Understanding Machine CPU usage](https://www.robustperception.io/understanding-machine-cpu-usage/)
689797
- [Blog Post: #LatencyTipOfTheDay: You can't average percentiles. Period.](http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-you-cant-average.html)
690798
- [Blog Post: Why Averages Suck and Percentiles are Great](https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/)
799+
- [Site Reliability Engineering books](https://landing.google.com/sre/books/)

0 commit comments

Comments
 (0)