From 2306721fe880181e2a187f348a4ec1bcc301d20e Mon Sep 17 00:00:00 2001 From: Ruud van Asseldonk Date: Wed, 22 May 2024 15:44:44 +0200 Subject: [PATCH] Make a start on the monitoring best practices --- src/SUMMARY.md | 2 +- src/node-software/monitoring.md | 64 ++++++++++++++++++++++++++++++--- 2 files changed, 60 insertions(+), 6 deletions(-) diff --git a/src/SUMMARY.md b/src/SUMMARY.md index f2892cb..55c9128 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -19,5 +19,5 @@ - [Communication channels]() - [Runtime environment]() - [Hardware requirements]() - - [Monitoring]() + - [Monitoring](node-software/monitoring.md) - [Operator interface]() diff --git a/src/node-software/monitoring.md b/src/node-software/monitoring.md index 8ff470c..8141a88 100644 --- a/src/node-software/monitoring.md +++ b/src/node-software/monitoring.md @@ -1,8 +1,62 @@ # Monitoring -> This chapter is not ready yet, check back later. +As [we described previously](../chorus-one/monitoring-alerting.md), +at Chorus One we use [Prometheus][prometheus] for monitoring and alerting. +This is the industry-standard monitoring protocol +that is supported by most software we run. -* Expose Prometheus metrics. -* Follow Prometheus best practices around metric naming. -* Have a way to disable telemetry. -* We do not grant anybody SSH access to our infrastructure, period. +[prometheus]: https://prometheus.io/ + +## Prometheus + +#### Expose Prometheus metrics. {.p1 #expose-prometheus-metrics} + +To be able to monitor the node software, +Prometheus needs a target to scrape. +See [the Prometheus documentation][prometheus-instrumenting] +for how to instrument your application. +If your daemon already includes an RPC server, +adding a `/metrics` endpoint there is usually the easiest way to go about it. +Alternatively, a dedicated metrics port works fine too. + +While the set of metrics is of course application-specific, +blockchain networks generally have a concept of the _block height_. +Note that unless the block height is for a finalized fork, +block height is generally a [gauge][prometheus-gauge] +and not a [counter][prometheus-counter]. + +[prometheus-instrumenting]: https://prometheus.io/docs/practices/instrumentation/ +[prometheus-gauge]: https://prometheus.io/docs/concepts/metric_types/#gauge +[prometheus-counter]: https://prometheus.io/docs/concepts/metric_types/#counter + +#### Expose metrics privately. {.p1 #expose-metrics-privately} + +While _we_ want to scape metrics, +we don’t want to expose confidential information to third parties. +It should be possible for the http server that serves the `/metrics` endpoint +to listen on a network interface that is not Internet-exposed. + +#### Respect Prometheus metric and label naming standards. {.p3 #respect-prometheus-standards} + +Prometheus [has an official standard for naming metrics and labels][prometheus-naming]. +Following the standard ensures that metrics are self-explanatory and easy to use, +and that our alerting configuration is consistent and uniform. In particular: + + * Prefix the metric with the name of your application. + * Metrics should use base units (bytes and seconds, not kilobytes or milliseconds). + * Metric names should have a suffix explaining the unit, + in plural (`_seconds`, `_bytes`). + * Accumulating counters should end in `_total`. + +[prometheus-naming]: https://prometheus.io/docs/practices/naming/ + +