You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -527,14 +532,14 @@ const requestDurationHistogram = new prometheusClient.Histogram({
527
532
```
528
533
529
534
Here we are using a _exponential_ bucket configuration in which the buckets
530
-
get larger as latency gets bigger. This is a widely used pattern; since we
535
+
double in size at every step. This is a widely used pattern; since we
531
536
always expect our services to respond quickly (i.e. with response time
532
537
between 0 and 300ms), we specify more buckets for that range, and fewer
533
538
buckets for request durations we think are less likely to occur.
534
539
535
540
According to the previous plot, all slow requests from our application
536
-
is falling into the 1s-2.5s bucket, causing us to lose precision when
537
-
calculating the 99th percentile.
541
+
are falling into the 1s-2.5s bucket, resulting in this loss of precision
542
+
when calculating the 99th percentile.
538
543
539
544
Since we know our application will take at most ~1s to respond, we can
540
545
choose a more appropriate bucket layout:
@@ -573,10 +578,10 @@ The reason is efficiency. Remember:
573
578
574
579
**more buckets == more time series == more space == slower queries**
575
580
576
-
Let's say you have an SLA to serve 99% of requests within 300ms. If all
577
-
you want to know is whether you are honoring your SLA or not, it doesn't
578
-
really matter if the quantile estimation is not accurate for requests
579
-
slower than 300ms.
581
+
Let's say you have an SLO (more details on SLOs later) to serve 99% of
582
+
requests within 300ms. If all you want to know is whether you are
583
+
honoring your SLO or not, it doesn't really matter if the quantile
584
+
estimation is not accurate for requests slower than 300ms.
580
585
581
586
You might also be wondering: if summaries are more precise, why not use
582
587
summaries instead of histograms?
@@ -593,7 +598,8 @@ with multiple replicas, you can safely use the `histogram_quantile()`
593
598
function to calculate the 99th percentile across all requests to all
594
599
replicas. You cannot do this with summaries. I mean, you can `avg()` the
595
600
99th percentiles of all replicas, or take the `max()`, but the value you
596
-
get will be statistically incorrect.
601
+
get will be statistically incorrect and could not be used as a proxy to the
602
+
99th percentile.
597
603
598
604
---
599
605
@@ -652,6 +658,108 @@ Grafana server at <http://localhost:3000>.
652
658
653
659
---
654
660
661
+
### Measuring SLOs and Error Budgets
662
+
663
+
> Managing service reliability is largely about managing risk, and managing risk
664
+
> can be costly.
665
+
>
666
+
> 100% is probably never the right reliability target: not only is it impossible
667
+
> to achieve, it's typically more reliability than a service's users want or
668
+
> notice.
669
+
670
+
SLOs, or _Service Level Objectives_, is one of the main tools employed by
671
+
[Site Reliability Engineers (SREs)](https://landing.google.com/sre/books/) for
672
+
making data-driven decisions about reliability.
673
+
674
+
SLOs are based on SLIs, or _Service Level Indicators_, which are the key metrics
675
+
that define how well (or how poorly) a given service is operating. Common SLIs
676
+
would be the number of failed requests, the number of requests slower than some
677
+
threshold, etc. Although different types of SLOs can be useful for different
678
+
types of systems, most HTTP-based services will have SLOs that can be
679
+
classified into two categories: **availability** and **latency**.
680
+
681
+
For instance, let's say these are the SLOs for our sample application:
682
+
683
+
| Category | SLI | SLO |
684
+
|-|-|-|
685
+
| Availability | The proportion of successful requests; any HTTP status other than 500-599 is considered successful | 95% successful requests |
686
+
| Latency | The proportion of requests with duration less than or equal to 100ms | 95% requests under 100ms |
687
+
688
+
The difference between 100% and the SLO is what we call the _Error Budget_.
689
+
In this example, the error budget for both SLOs is 5%; if the application
690
+
receives 1,000 requests during the SLO window (let's say one minute for the
691
+
purposes of this tutorial), it means that 50 requests can fail and we'll
692
+
still meet our SLO.
693
+
694
+
But do we need additional metrics for keeping track of these SLOs? Probably
695
+
not. If you are tracking request durations with a histogram (as we are here),
696
+
chances are you don't need to do anything else. You already got all the
697
+
metrics you need!
698
+
699
+
Let's send a few requests to the server so we can play around with the metrics:
700
+
701
+
```sh
702
+
$ while true; do curl -s http://localhost:4000 > /dev/null ; done
703
+
```
704
+
705
+
```sh
706
+
# Number of requests served in the SLO window
707
+
sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job)
708
+
709
+
# Number of requests that violated the latency SLO (all requests that took more than 100ms to be served)
710
+
sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - sum(increase(sample_app_histogram_request_duration_seconds_bucket{le="0.1"}[1m])) by (job)
711
+
712
+
# Number of requests in the error budget: (100% - [slo threshold]) * [number of requests served]
713
+
(1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job)
714
+
715
+
# Remaining requests in the error budget: [number of requests in the error budget] - [number of requests that violated the latency SLO]
716
+
(1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - (sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - sum(increase(sample_app_histogram_request_duration_seconds_bucket{le="0.1"}[1m])) by (job))
717
+
718
+
# Remaining requests in the error budget as a ratio: ([number of requests in the error budget] - [number of requests that violated the SLO]) / [number of requests in the error budget]
719
+
((1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - (sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - sum(increase(sample_app_histogram_request_duration_seconds_bucket{le="0.1"}[1m])) by (job))) / ((1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job))
720
+
```
721
+
722
+
Due to the simulated scenario in which ~5% of requests takes 1s to complete,
723
+
if you try the last query you should see that the average budget available
724
+
is around 0%, that is, we have no more budget to spend and will inevitably
725
+
break the latency SLO if more requests start to take more time to be served.
726
+
This is not a good place to be.
727
+
728
+

729
+
730
+
But what if we had a more strict SLO, say, 99% instead of 95%? What would be
731
+
the impact of these slow requests on the error budget?
732
+
733
+
Just replace all `0.95` by `0.99` in that query to see what would happen:
734
+
735
+

736
+
737
+
In the previous scenario with the 95% SLO, the SLO _burn rate_ was ~1x, which
738
+
means the whole error budget was being consumed during the SLO window, that is,
739
+
in 60 seconds. Now, with the 99% SLO, the burn rate was ~3x, which means that
740
+
instead of taking one minute for the error budget to exhaust, it now takes
741
+
only ~20 seconds!
742
+
743
+
Now change the `curl` to point to the `/metrics` endpoint, which do not have
744
+
the simulated long latency for 5% of the requests, and you should see the error
745
+
budget go back to 100% again:
746
+
747
+
```bash
748
+
$ while true; do curl -s http://localhost:4000/metrics > /dev/null ; done
749
+
```
750
+
751
+

752
+
753
+
---
754
+
755
+
**Want to know more?** The
756
+
[Site Reliability Workbook](https://landing.google.com/sre/books/) is a great
757
+
resource on this topic and includes more advanced concepts such as how to alert
758
+
based on SLO burn rate as a way to improve alert precision/recall and
759
+
detection/reset times.
760
+
761
+
---
762
+
655
763
### Monitoring Applications Without a Metrics Endpoint
656
764
657
765
We learned that Prometheus needs all applications to expose a `/metrics`
@@ -688,3 +796,4 @@ hard time when creating queries later.
688
796
- [Blog Post: Understanding Machine CPU usage](https://www.robustperception.io/understanding-machine-cpu-usage/)
689
797
- [Blog Post: #LatencyTipOfTheDay: You can't average percentiles. Period.](http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-you-cant-average.html)
690
798
- [Blog Post: Why Averages Suck and Percentiles are Great](https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/)
0 commit comments