@@ -5,10 +5,11 @@ sort_rank: 4
5
5
6
6
# Histograms and summaries
7
7
8
- Histograms and summaries are more complex metric types. Not only
9
- creates a single histogram or summary a multitude of time series, it
10
- is also more difficult to use them correctly. This section helps you
11
- to pick and configure the appropriate metric type for your use case.
8
+ Histograms and summaries are more complex metric types. Not only does
9
+ a single histogram or summary create a multitude of time series, it is
10
+ also more difficult to use these metric types correctly. This section
11
+ helps you to pick and configure the appropriate metric type for your
12
+ use case.
12
13
13
14
## Library support
14
15
@@ -18,13 +19,7 @@ First of all, check the library support for
18
19
both currently only exists in the Go client library. Many libraries
19
20
support only one of the two types, or they support summaries only in a
20
21
limited fashion (lacking [ quantile
21
- calculation] ( #quantiles ) ). [ Contributions are welcome] ( /community/ ) ,
22
- of course. In general, we expect histograms to be more urgently needed
23
- than summaries. Histograms are also easier to implement in a client
24
- library, so we recommend to implement histograms first, if in
25
- doubt. The reason why some libraries offer summaries but not
26
- histograms (Ruby, the legacy Java client) is that histograms are a
27
- more recent feature of Prometheus.
22
+ calculation] ( #quantiles ) ).
28
23
29
24
## Count and sum of observations
30
25
@@ -35,20 +30,20 @@ durations or response sizes. They track the number of observations
35
30
(showing up in Prometheus as a time series with a ` _count ` suffix) is
36
31
inherently a counter (as described above, it only goes up). The sum of
37
32
observations (showing up as a time series with a ` _sum ` suffix)
38
- behaves like a counter, too, as long as all observations are
39
- positive . Obviously, request durations or response sizes are always
40
- positive . In principle, however, you can use summaries and histograms
41
- to observe negative values (e.g. temperatures in centigrade). In that
42
- case, the sum of observations can go down, so you cannot apply
43
- ` rate() ` to it anymore.
33
+ behaves like a counter, too, as long as there are no negative
34
+ observations . Obviously, request durations or response sizes are
35
+ never negative . In principle, however, you can use summaries and
36
+ histograms to observe negative values (e.g. temperatures in
37
+ centigrade). In that case, the sum of observations can go down, so you
38
+ cannot apply ` rate() ` to it anymore.
44
39
45
40
To calculate the average request duration during the last 5 minutes
46
- from a histogram or summary called ` http_request_duration_second ` , use
47
- the following expression:
41
+ from a histogram or summary called ` http_request_duration_seconds ` ,
42
+ use the following expression:
48
43
49
- rate(http_request_duration_seconds_sum[5m])
50
- /
51
- rate(http_request_duration_seconds_count[5m])
44
+ rate(http_request_duration_seconds_sum[5m])
45
+ /
46
+ rate(http_request_duration_seconds_count[5m])
52
47
53
48
## Apdex score
54
49
@@ -64,24 +59,25 @@ requests served within 300ms and easily alert if the value drops below
64
59
served in the last 5 minutes. The request durations were collected with
65
60
a histogram called ` http_request_duration_seconds ` .
66
61
67
- sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
68
- /
69
- sum(rate(http_request_duration_seconds_count[5m])) by (job)
62
+ sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
63
+ /
64
+ sum(rate(http_request_duration_seconds_count[5m])) by (job)
70
65
71
66
72
67
You can calculate the well-known [ Apdex
73
68
score] ( http://en.wikipedia.org/wiki/Apdex ) in a similar way. Configure
74
- a bucket with the target request duration as upper bound and another
75
- bucket with the tolerated request duration (usually 4 times the target
76
- request duration) as upper bound. Example: The target request duration
77
- is 300ms. The tolerable request duration is 1.2s. The following
78
- expression yields the Apdex score over the last 5 minutes:
69
+ a bucket with the target request duration as the upper bound and
70
+ another bucket with the tolerated request duration (usually 4 times
71
+ the target request duration) as the upper bound. Example: The target
72
+ request duration is 300ms. The tolerable request duration is 1.2s. The
73
+ following expression yields the Apdex score for each job over the last
74
+ 5 minutes:
79
75
80
76
(
81
- rate(http_request_duration_seconds_bucket{le="0.3"}[5m])
82
- +
83
- rate(http_request_duration_seconds_bucket{le="1.2"}[5m])
84
- ) / 2 / rate(http_request_duration_seconds_count[5m])
77
+ sum( rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job )
78
+ +
79
+ sum( rate(http_request_duration_seconds_bucket{le="1.2"}[5m])) by (job )
80
+ ) / 2 / sum( rate(http_request_duration_seconds_count[5m])) by (job )
85
81
86
82
## Quantiles
87
83
@@ -92,7 +88,7 @@ known as the median. The 0.95-quantile is the 95th percentile.
92
88
93
89
The essential difference between summaries and histograms is that summaries
94
90
calculate streaming φ-quantiles on the client side and expose them directly,
95
- while histograms expose bucketed observations counts and the calculation of
91
+ while histograms expose bucketed observation counts and the calculation of
96
92
quantiles from the buckets of a histogram happens on the server side using the
97
93
[ ` histogram_quantile() `
98
94
function] ( /docs/querying/functions/#histogram_quantile() ) .
@@ -115,8 +111,8 @@ want to display the percentage of requests served within 300ms, but
115
111
instead the 95th percentile, i.e. the request duration within which
116
112
you have served 95% of requests. To do that, you can either configure
117
113
a summary with a 0.95-quantile and (for example) a 5-minute decay
118
- time-window , or you configure a histogram with a few buckets around
119
- the 300ms mark, e.g. ` {le="0.1"} ` , ` {le="0.2"} ` , ` {le="0.3"} ` , and
114
+ time, or you configure a histogram with a few buckets around the 300ms
115
+ mark, e.g. ` {le="0.1"} ` , ` {le="0.2"} ` , ` {le="0.3"} ` , and
120
116
` {le="0.45"} ` . If your service runs replicated with a number of
121
117
instances, you will collect request durations from every single one of
122
118
them, and then you want to aggregate everything into an overall 95th
@@ -157,11 +153,11 @@ quantile gives you the impression that you are close to breaking the
157
153
SLA, but in reality, the 95th percentile is a tiny bit above 220ms,
158
154
a quite comfortable distance to your SLA.
159
155
160
- Next step in our * Gedenkenexperiment * : A change in backend routing
161
- adds a fixed amount of 100ms to all requent durations. Now the request
156
+ Next step in our thought experiment : A change in backend routing
157
+ adds a fixed amount of 100ms to all request durations. Now the request
162
158
duration has its sharp spike at 320ms and almost all observations will
163
159
fall into the bucket from 300ms to 450ms. The 95th percentile is
164
- calculated to be 442.5ms, although the correct values is close to
160
+ calculated to be 442.5ms, although the correct value is close to
165
161
320ms. While you are only a tiny bit outside of your SLA, the
166
162
calculated 95th quantile looks much worse.
167
163
@@ -213,8 +209,18 @@ Two rules of thumb:
213
209
214
210
1 . If you need to aggregate, choose histograms.
215
211
216
- 2 . Otherwise, choose a histogram if you need accuracy in the
217
- dimension of the observed values and you have an idea in which
218
- ranges of observed values you are interested in. Choose a summary
219
- if you need accuracy in the dimension of φ, no matter in which
220
- ranges of observed values the quantile will end up.
212
+ 2 . Otherwise, choose a histogram if you have an idea of the range
213
+ and distribution of values that will be observed. Choose a
214
+ summary if you need an accurate quantile, no matter what the
215
+ range and distribution of the values is.
216
+
217
+
218
+ ## What can I do if my client library does not support the metric type I need?
219
+
220
+ Implement it! [ Code contributions are welcome] ( /community/ ) . In
221
+ general, we expect histograms to be more urgently needed than
222
+ summaries. Histograms are also easier to implement in a client
223
+ library, so we recommend to implement histograms first, if in
224
+ doubt. The reason why some libraries offer summaries but not
225
+ histograms (the Ruby client and the legacy Java client) is that
226
+ histograms are a more recent feature of Prometheus.
0 commit comments