Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog: why histograms? #2658

Merged
merged 13 commits into from
May 8, 2023
Prev Previous commit
Next Next commit
Apply suggestions from code review
Co-authored-by: Phillip Carter <pcarter@fastmail.com>
  • Loading branch information
2 people authored and chalin committed May 7, 2023
commit 3f808434e523b08bf7e2c5a9c3690c5d2eaed68a
34 changes: 16 additions & 18 deletions content/en/blog/2023/why-histograms/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,23 +7,21 @@ author: '[Daniel Dyla](https://github.com/dyladan)'
<i>Originally posted to
https://dyladan.me/histograms/2023/05/02/why-histograms/</i>

A histogram is a multi-value counter which summarizes the distribution of one or
more data points. For example, a histogram may have 3 counters which count the
A histogram is a multi-value counter that summarizes the distribution of data points.
For example, a histogram may have 3 counters which count the
occurrences of negative, positive, and zero values respectively. Given a series
of numbers, `3`, `-9`, `7`, `6`, `0`, and `-1`, the histogram would count `2`
negative, `1` zero, and `3` positive values. A single histogram data point is
most commonly represented as a bar chart.

![histogram point as bar chart](hist-point.png "A single histogram point plotted as a bar chart with 3 buckets titled 'Positivity of numbers'. The first bucket shows negative numbers and has a height of 2. The second bucket shows zero values and has a height of 1. The third bucket shows positive values and has a height of 3.")

This simple example has only 3 possible output values, but it is not uncommon to
have many more in a single histogram. Similarly, this shows only 1 histogram
data point, but in the world of observability data is usually constantly
exported. For example, an application may export a histogram every minute which
summarizes a metric for the previous minute. In this way, you can study how the
distribution of your data changes over time.
The above example has only 3 possible output values, but it is common to
have many more in a single histogram. A real-world application typically exports
a histogram every minute that summarizes a metric for the previous minute.
By using histograms this way, you can study how the distribution of your data changes over time.

# What are histograms for?
## What are histograms for?

There are many uses for histograms, but their power comes from the ability to
efficiently answer queries about the distribution of your data. These queries
Expand All @@ -33,7 +31,7 @@ shorthand like `p50` for the 50th percentile or 0.5-quantile, also known as the
median. More generally, the φ-quantile is the observation value that ranks at
number φ\*N among the N observations.

# Why is that useful?
## Why are Histograms useful?

The most common use-case for histograms in the observability space is defining
service level objectives (SLOs). One example of such an SLO might be ">=99% of
Expand All @@ -50,7 +48,7 @@ under 90ms.

![p99, p90, and p50 plotted as lines](hist-lines.png "p99, p90, and p50 plotted as a line chart with title 'response times.' Time is on the x-axis and response times in milliseconds on the y-axis. p99 response times are around 80 milliseconds. p90 response times are betweeen 60 and 80 milliseconds. p50 response times are between 20 and 30 milliseconds.")

# Other metric types
## Other metric types

Another solution might be to define the SLOs you're interested in and collect
them as non-histogram metrics in advance as gauges or counters. This approach
Expand All @@ -65,16 +63,16 @@ already collected. Particularly with exponential histograms, arbitrary
distribution queries can be made with very low relative error rates and minimal
resource consumption on both the client and the analysis backend.

The inflexibility of this approach also impacts your ability to gauge impact
when your SLO is violated. For example, imagine you are collecting a gauge which
The inflexibility of not using histograms for SLOs also impacts your ability to gauge impact
when your SLO is violated. For example, imagine you are collecting a gauge that
calculates the `p99` of some metric and you define an SLO based on it. When your
SLO is violated and an alert is triggered, how do you know it is really only
affecting 1% of queries, 10%, or 50%? A histogram would allow you to answer that
question by simply querying the percentiles you're interested in. You could
collect additional gauges for each percentile, but then you've just force users
to reimplement histograms on their own. Probably poorly.
affecting 1% of queries, 10%, or 50%? A histogram allows you to answer that
question by querying the percentiles you're interested in. You could technically
collect additional gauges for each percentile, but that's just an ad-hoc reimplementation
of histograms anyways, so you're better off using histograms.

# Other data sources and metric types
## Other data sources and metric types

You may ask why you would report a separate metric rather than calculating these
metrics from your existing log and trace data? While it is true that for _some_
Expand Down