Make a start on the monitoring best practices

ChorusOne · May 22, 2024 · 2306721 · 2306721
1 parent 0fd7667
commit 2306721
Show file tree

Hide file tree

Showing 2 changed files with 60 additions and 6 deletions.
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
@@ -19,5 +19,5 @@
    - [Communication channels]() <!-- node-software/communication-channels.md -->
    - [Runtime environment]() <!-- node-software/runtime-environment.md -->
    - [Hardware requirements]() <!-- node-software/hardware-requirements.md -->
-   - [Monitoring]() <!-- (node-software/monitoring.md) -->
+   - [Monitoring](node-software/monitoring.md)
    - [Operator interface]() <!-- (node-software/operator-interface.md) -->
diff --git a/src/node-software/monitoring.md b/src/node-software/monitoring.md
@@ -1,8 +1,62 @@
 # Monitoring
 
-> This chapter is not ready yet, check back later.
+As [we described previously](../chorus-one/monitoring-alerting.md),
+at Chorus One we use [Prometheus][prometheus] for monitoring and alerting.
+This is the industry-standard monitoring protocol
+that is supported by most software we run.
 
-* Expose Prometheus metrics.
-* Follow Prometheus best practices around metric naming.
-* Have a way to disable telemetry.
-* We do not grant anybody SSH access to our infrastructure, period.
+[prometheus]: https://prometheus.io/
+
+## Prometheus
+
+#### Expose Prometheus metrics. {.p1 #expose-prometheus-metrics}
+
+To be able to monitor the node software,
+Prometheus needs a target to scrape.
+See [the Prometheus documentation][prometheus-instrumenting]
+for how to instrument your application.
+If your daemon already includes an RPC server,
+adding a `/metrics` endpoint there is usually the easiest way to go about it.
+Alternatively, a dedicated metrics port works fine too.
+
+While the set of metrics is of course application-specific,
+blockchain networks generally have a concept of the _block height_.
+Note that unless the block height is for a finalized fork,
+block height is generally a [gauge][prometheus-gauge]
+and not a [counter][prometheus-counter].
+
+[prometheus-instrumenting]: https://prometheus.io/docs/practices/instrumentation/
+[prometheus-gauge]: https://prometheus.io/docs/concepts/metric_types/#gauge
+[prometheus-counter]: https://prometheus.io/docs/concepts/metric_types/#counter
+
+#### Expose metrics privately. {.p1 #expose-metrics-privately}
+
+While _we_ want to scape metrics,
+we don’t want to expose confidential information to third parties.
+It should be possible for the http server that serves the `/metrics` endpoint
+to listen on a network interface that is not Internet-exposed.
+
+#### Respect Prometheus metric and label naming standards. {.p3 #respect-prometheus-standards}
+
+Prometheus [has an official standard for naming metrics and labels][prometheus-naming].
+Following the standard ensures that metrics are self-explanatory and easy to use,
+and that our alerting configuration is consistent and uniform. In particular:
+
+ * Prefix the metric with the name of your application.
+ * Metrics should use base units (bytes and seconds, not kilobytes or milliseconds).
+ * Metric names should have a suffix explaining the unit,
+   in plural (`_seconds`, `_bytes`).
+ * Accumulating counters should end in `_total`.
+
+[prometheus-naming]: https://prometheus.io/docs/practices/naming/
+
+<!-- TODO: Finish this section
+
+## Telementry
+
+TODO.
+* Have a way to disable.
+* We are fine to share on incentivized testnets.
+* We do not grant SSH access to our infrastructure, period.
+
+-->