Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address feedback about monitoring and release engineering #2

Merged
merged 3 commits into from
Jun 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions src/node-software/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,17 @@ but we don’t want to expose confidential information to third parties.
It should be possible for the http server that serves the `/metrics` endpoint
to listen on a network interface that is not Internet-exposed.

#### Ensure that metrics are relevant and named appropriately. {.p1 #metrics-are-relevant}

For new projects, of course you only add metrics that measure something relevant.
For projects that fork existing node software,
we encountered software in the past that kept exposing metrics that were no longer meaningful,
or under the name of the original software.
Similar to how clear but incorrect error messages are worse than vague error messages,
misleading metrics are more harmful than not having metrics at all.
“Maybe the metrics are lying to us”
is far down the list of possible causes when troubleshooting.

#### Respect Prometheus metric and label naming standards. {.p3 #respect-prometheus-standards}

Prometheus [has an official standard for naming metrics and labels][prometheus-naming].
Expand All @@ -58,6 +69,101 @@ In particular:

[prometheus-naming]: https://prometheus.io/docs/practices/naming/

#### If you expose system metrics, provide a way to disable them. {.p3 #system-metrics-can-be-disabled}

We already run the [Prometheus node exporter][node-exporter] on our hosts.
Exposing that same information from the node software unnecessarily bloats `/metrics` responses,
which puts strain on our bandwidth and storage,
and collecting the information can make the `/metrics` endpoint slow.

[node-exporter]: https://prometheus.io/docs/guides/node-exporter/

#### Expose the node software version as a metric. {.p3 #version-metric}

For automating rollouts,
but also for monitoring manual rollouts,
and observability and troubleshooting in general,
it is useful for us to have a way of identifying what version is running at runtime.
When you run one instance this is easy to track externally,
but when you run a dozen nodes,
it’s easy to lose track of which versions run where.
Exposing a version metric
(with value `1` and the version as a label)
is one of the most convenient ways to expose version information.

#### Expose the validator identity as a metric. {.p3 #identity-metric}

Similar to having runtime information about the version,
when managing multiple nodes,
it is useful to know which identity (address or pubkey) runs where.
Like the version, a convenient place to expose this is in Prometheus metrics.

<!--
TODO: It should *also* be part of the RPC,
cross-reference that after I write the chapter about RPC interface.
-->

## Health

#### Expose an endpoint for health checks. {.p2 #health-endpoint}

For automating restarts and failover,
and for loadbalancing across RPC nodes,
it is useful to have an endpoint where the node software
reports its own view on whether it is healthy and in sync with the network.
A convenient place to do this is with a `/health` or `/status` http endpoint
on the RPC interface.

Ideally the application should respond on that endpoint
even during the startup phase and report startup progress there.

## On-chain metrics

It is essential to have metrics exposed by the node software,
but this can only give us a _local_ view.
[We need to have a _global_ view as well.][monitoring-global]
For example,
a validator may be performing its duties
(such as producing blocks, voting, or attestation),
but end up in a minority network partition
that causes the majority of the network to view the validator as delinquent.

When information about a validator is stored on-chain,
there is a single source of truth about whether the validator performed its duties,
and that fact becomes finalized through consensus.
For example,
for networks that have a known leader assigned to every slot,
whether the block was produced or not is a property of the chain
that all honest nodes agree on.
Some networks additionally store heartbeats or consensus votes on-chain.

We need a way to monitor those on-chain events
to measure our own performance.
This can be built into the node software
(so we can run multiple nodes that monitor each other),
or it can be an external tool that connects to an RPC node
and exposes Prometheus metrics about on-chain events.

[monitoring-global]: ../chorus-one/monitoring-alerting.md#local-and-global-views

#### Provide a way to monitor on-chain metrics. {.p3 #on-chain-monitoring}

Ideally,
we would have Prometheus metrics
about whether a validator identity has been performing its duties,
exposed from an independent place that is not that validator itself.
For most networks these exporters are standalone applications,
but integrating this into the node software can also work.

Good monitoring and observability tools are a public good
that benefits all validators.
Observability is a core requirement for us,
but we realize that it may not be top priority for node software authors.
We are happy to contribute here,
and work with you upstream to improve or develop
open source monitoring solutions
that benefit the wider ecosystem.

## Telemetry

We understand that node software authors
Expand Down
26 changes: 26 additions & 0 deletions src/node-software/release-engineering.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,32 @@ is an invite-only Discord channel
where many kinds of announcements are being shared
in addition to just release announcements.

#### Clearly mark breaking changes. {.p2 #mark-breaking-changes}
When we update to a new version,
we need to know if any additional action is required from us.
For example, when command-line flags are renamed or removed,
or when the schema of a configuration file changes,
the new node software would be unable to start
if we don’t update our configuration.
To minimize downtime,
we would rather learn about such changes _before_ we perform the update.
Even when the node is able to start,
changes in e.g. metric names or RPC API affect us.

Ideally breaking changes are part of [a changelog](#keep-a-changelog),
clearly highlighted to stand out from ordinary changes.
If you don’t keep a changelog,
you can include breaking changes in e.g. a GitHub releases page,
or in the release announcement itself.

#### Clearly announce deadlines. {.p2 #mark-deadlines}
When an update has a deadline,
for example for a hard fork,
clearly mark the deadline.
When possible, include both a date/time and block height,
and a URL for where the update is coordinated.
Make sure to [publish the release far enough ahead of the deadline](#publish-headroom).

#### Keep a changelog. {.p3 #keep-a-changelog}
For us node operators,
the first thing we wonder when we see a new release is:
Expand Down