ChorusOne · ruuda · Jun 19, 2024 · Jun 14, 2024 · Jun 14, 2024 · Jun 14, 2024
diff --git a/src/node-software/monitoring.md b/src/node-software/monitoring.md
@@ -43,6 +43,17 @@ but we don’t want to expose confidential information to third parties.
 It should be possible for the http server that serves the `/metrics` endpoint
 to listen on a network interface that is not Internet-exposed.
 
+#### Ensure that metrics are relevant and named appropriately. {.p1 #metrics-are-relevant}
+
+For new projects, of course you only add metrics that measure something relevant.
+For projects that fork existing node software,
+we encountered software in the past that kept exposing metrics that were no longer meaningful,
+or under the name of the original software.
+Similar to how clear but incorrect error messages are worse than vague error messages,
+misleading metrics are more harmful than not having metrics at all.
+“Maybe the metrics are lying to us”
+is far down the list of possible causes when troubleshooting.
+
 #### Respect Prometheus metric and label naming standards. {.p3 #respect-prometheus-standards}
 
 Prometheus [has an official standard for naming metrics and labels][prometheus-naming].
@@ -58,6 +69,101 @@ In particular:
 
 [prometheus-naming]: https://prometheus.io/docs/practices/naming/
 
+#### If you expose system metrics, provide a way to disable them. {.p3 #system-metrics-can-be-disabled}
+
+We already run the [Prometheus node exporter][node-exporter] on our hosts.
+Exposing that same information from the node software unnecessarily bloats `/metrics` responses,
+which puts strain on our bandwidth and storage,
+and collecting the information can make the `/metrics` endpoint slow.
+
+[node-exporter]: https://prometheus.io/docs/guides/node-exporter/
+
+#### Expose the node software version as a metric. {.p3 #version-metric}
+
+For automating rollouts,
+but also for monitoring manual rollouts,
+and observability and troubleshooting in general,
+it is useful for us to have a way of identifying what version is running at runtime.
+When you run one instance this is easy to track externally,
+but when you run a dozen nodes,
+it’s easy to lose track of which versions run where.
+Exposing a version metric
+(with value `1` and the version as a label)
+is one of the most convenient ways to expose version information.
+
+#### Expose the validator identity as a metric. {.p3 #identity-metric}
+
+Similar to having runtime information about the version,
+when managing multiple nodes,
+it is useful to know which identity (address or pubkey) runs where.
+Like the version, a convenient place to expose this is in Prometheus metrics.
+
+<!--
+TODO: It should *also* be part of the RPC,
+cross-reference that after I write the chapter about RPC interface.
+-->
+
+## Health
+
+#### Expose an endpoint for health checks. {.p2 #health-endpoint}
+
+For automating restarts and failover,
+and for loadbalancing across RPC nodes,
+it is useful to have an endpoint where the node software
+reports its own view on whether it is healthy and in sync with the network.
+A convenient place to do this is with a `/health` or `/status` http endpoint
+on the RPC interface.
+
+Ideally the application should respond on that endpoint
+even during the startup phase and report startup progress there.
+
+## On-chain metrics
+
+It is essential to have metrics exposed by the node software,
+but this can only give us a _local_ view.
+[We need to have a _global_ view as well.][monitoring-global]
+For example,
+a validator may be performing its duties
+(such as producing blocks, voting, or attestation),
+but end up in a minority network partition
+that causes the majority of the network to view the validator as delinquent.
+
+When information about a validator is stored on-chain,
+there is a single source of truth about whether the validator performed its duties,
+and that fact becomes finalized through consensus.
+For example,
+for networks that have a known leader assigned to every slot,
+whether the block was produced or not is a property of the chain
+that all honest nodes agree on.
+Some networks additionally store heartbeats or consensus votes on-chain.
+
+We need a way to monitor those on-chain events
+to measure our own performance.
+This can be built into the node software
+(so we can run multiple nodes that monitor each other),
+or it can be an external tool that connects to an RPC node
+and exposes Prometheus metrics about on-chain events.
+
+[monitoring-global]: ../chorus-one/monitoring-alerting.md#local-and-global-views
+
+#### Provide a way to monitor on-chain metrics. {.p3 #on-chain-monitoring}
+
+Ideally,
+we would have Prometheus metrics
+about whether a validator identity has been performing its duties,
+exposed from an independent place that is not that validator itself.
+For most networks these exporters are standalone applications,
+but integrating this into the node software can also work.
+
+Good monitoring and observability tools are a public good
+that benefits all validators.
+Observability is a core requirement for us,
+but we realize that it may not be top priority for node software authors.
+We are happy to contribute here,
+and work with you upstream to improve or develop
+open source monitoring solutions
+that benefit the wider ecosystem.
+
 ## Telemetry
 
 We understand that node software authors

diff --git a/src/node-software/release-engineering.md b/src/node-software/release-engineering.md
@@ -165,6 +165,32 @@ is an invite-only Discord channel
 where many kinds of announcements are being shared
 in addition to just release announcements.
 
+#### Clearly mark breaking changes. {.p2 #mark-breaking-changes}
+When we update to a new version,
+we need to know if any additional action is required from us.
+For example, when command-line flags are renamed or removed,
+or when the schema of a configuration file changes,
+the new node software would be unable to start
+if we don’t update our configuration.
+To minimize downtime,
+we would rather learn about such changes _before_ we perform the update.
+Even when the node is able to start,
+changes in e.g. metric names or RPC API affect us.
+
+Ideally breaking changes are part of [a changelog](#keep-a-changelog),
+clearly highlighted to stand out from ordinary changes.
+If you don’t keep a changelog,
+you can include breaking changes in e.g. a GitHub releases page,
+or in the release announcement itself.
+
+#### Clearly announce deadlines. {.p2 #mark-deadlines}
+When an update has a deadline,
+for example for a hard fork,
+clearly mark the deadline.
+When possible, include both a date/time and block height,
+and a URL for where the update is coordinated.
+Make sure to [publish the release far enough ahead of the deadline](#publish-headroom).
+
 #### Keep a changelog. {.p3 #keep-a-changelog}
 For us node operators,
 the first thing we wonder when we see a new release is: