@@ -15,11 +15,10 @@ me. Pull Requests are welcome!
15
15
- [ Cleaning Up] ( #cleaning-up )
16
16
- [ Prometheus Overview] ( #prometheus-overview )
17
17
- [ Push vs Pull] ( #push-vs-pull )
18
- - [ Metrics Endpoint] ( #metrics-endpoint )
19
- - [ Time Series and Data Points] ( #time-series-and-data-points )
18
+ - [ Metrics Endpoint] ( #metrics-endpoint )
20
19
- [ Duplicate Metrics Names?] ( #duplicate-metrics-names )
21
20
- [ Monitoring Uptime] ( #monitoring-uptime )
22
- - [ A Basic Uptime Alert] ( #a-basic-uptime-alert )
21
+ - [ A Basic Uptime Alert] ( #a-basic-uptime-alert )
23
22
- [ Instrumenting Your Applications] ( #instrumenting-your-applications )
24
23
- [ Measuring Request Durations] ( #measuring-request-durations )
25
24
- [ Quantile Estimation Errors] ( #quantile-estimation-errors )
@@ -124,7 +123,7 @@ other tools in the monitoring space regarding scope, data model, and storage.
124
123
Now, if the application doesn't push metrics to the metrics server, how does
125
124
the applications metrics end up in Prometheus?
126
125
127
- #### Metrics Endpoint
126
+ ### Metrics Endpoint
128
127
129
128
Applications expose metrics to Prometheus via a _ metrics endpoint_ . To see how
130
129
this works, let's start everything by running ` docker-compose up -d ` if you
@@ -199,8 +198,6 @@ In this snippet alone we can notice a few interesting things:
199
198
But how does this text-based response turns into data points in a time series
200
199
database?
201
200
202
- # ## Time Series and Data Points
203
-
204
201
The best way to understand this is by running a few simple queries.
205
202
206
203
Open the Prometheus UI at <http://localhost:9090/graph>, type
@@ -231,7 +228,7 @@ Prometheus UI):
231
228
232
229
| Element | Value |
233
230
|---------|-------|
234
- | process_resident_memory_bytes{instance="grafana:3000",job="grafana"} | 40861696@1530461477.446 43298816@1530461482.447 43778048@1530461487.451 44785664@1530461492.447 44785664@1530461497.447 45043712@1530461502.448 45043712@1530461507.448 45301760@1530461512.451 45301760@1530461517.448 45301760@1530461522.448 45895680@1530461527.448 45895680@1530461532.447 |
231
+ | process_resident_memory_bytes{instance="grafana:3000",job="grafana"} | 40861696@1530461477.446<br/> 43298816@1530461482.447<br/> 43778048@1530461487.451<br/> 44785664@1530461492.447<br/> 44785664@1530461497.447<br/> 45043712@1530461502.448<br/> 45043712@1530461507.44<br/> 45301760@1530461512.451<br/> 45301760@1530461517.448<br/> 45301760@1530461522.448<br/> 45895680@1530461527.448<br/> 45895680@1530461532.447 |
235
232
236
233
# ## Duplicate Metrics Names?
237
234
@@ -247,7 +244,8 @@ Prometheus, and our sample application all export a gauge metric under the
247
244
same name. However, did you notice in the previous plot that somehow we were
248
245
able to get a separate time series from each application?
249
246
250
- Quoting the [documentation](https://prometheus.io/docs/concepts/jobs_instances/) :
247
+ Quoting the
248
+ [documentation](https://prometheus.io/docs/concepts/jobs_instances/) :
251
249
252
250
> In Prometheus terms, an endpoint you can scrape is called an **instance**,
253
251
> usually corresponding to a single process. A collection of instances with
@@ -264,8 +262,8 @@ exposing this metric, we can see three lines in that plot.
264
262
265
263
# ## Monitoring Uptime
266
264
267
- For each instance scrape, Prometheus stores a `up` metric with the value `1` when
268
- the instance is healthy, i.e. reachable, or `0` if the scrape failed.
265
+ For each instance scrape, Prometheus stores a `up` metric with the value `1`
266
+ when the instance is healthy, i.e. reachable, or `0` if the scrape failed.
269
267
270
268
Try plotting the query `up` in the Prometheus UI.
271
269
@@ -291,7 +289,7 @@ handles usage (in %) for all targets? **Tip:** the metric names end with
291
289
292
290
---
293
291
294
- # #### A Basic Uptime Alert
292
+ # ### A Basic Uptime Alert
295
293
296
294
We don't want to keep staring at dashboards in a big TV screen all day
297
295
to be able to quickly detect issues in our applications, afterall, we have
@@ -307,7 +305,8 @@ OpsGenie). It also takes care of silencing and inhibition of alerts.
307
305
Configuring Alertmanager to send metrics to PagerDuty, or Slack, or whatever,
308
306
is out of the scope of this workshop, but we can still play around with alerts.
309
307
310
- Let's define our first alerting rule in `config/prometheus/prometheus.rules.yml` :
308
+ Let's define our first alerting rule in
309
+ `config/prometheus/prometheus.rules.yml` :
311
310
312
311
` ` ` yaml
313
312
# Uptime alerting rule
@@ -368,7 +367,8 @@ We can measure request durations with
368
367
[percentiles](https://en.wikipedia.org/wiki/Quantile) or
369
368
[averages](https://en.wikipedia.org/wiki/Arithmetic_mean). However,
370
369
it's not recommended relying on averages to track request durations because
371
- averages can be very misleading (see the [References](#references) for a few posts on the pitfalls of averages and how percentiles can help).
370
+ averages can be very misleading (see the [References](#references) for a few
371
+ posts on the pitfalls of averages and how percentiles can help).
372
372
373
373
In Prometheus, we can generate percentiles with summaries or histograms.
374
374
@@ -471,7 +471,8 @@ The result of these queries may seem surprising.
471
471
The first thing to notice is how the average response time fails to
472
472
communicate the actual behavior of the response duration distribution
473
473
(avg : 50ms; p99: 1s); the second is how the 99th percentile reported by the
474
- the summary (1s) is quite different than the one estimated by the `histogram_quantile()` function (~2.2s). How can this be?
474
+ the summary (1s) is quite different than the one estimated by the
475
+ ` histogram_quantile()` function (~2.2s). How can this be?
475
476
476
477
# ### Quantile Estimation Errors
477
478
0 commit comments