Skip to content

Commit

Permalink
Finish first version of monitoring chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
ruuda committed May 24, 2024
1 parent 2306721 commit be5c9db
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 16 deletions.
2 changes: 1 addition & 1 deletion src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
- [Software development best practices](node-software/development-practices.md)
- [Build process](node-software/build-process.md)
- [Release engineering](node-software/release-engineering.md)
- [Monitoring](node-software/monitoring.md)
- [Communication channels]() <!-- node-software/communication-channels.md -->
- [Runtime environment]() <!-- node-software/runtime-environment.md -->
- [Hardware requirements]() <!-- node-software/hardware-requirements.md -->
- [Monitoring](node-software/monitoring.md)
- [Operator interface]() <!-- (node-software/operator-interface.md) -->
59 changes: 44 additions & 15 deletions src/node-software/monitoring.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
# Monitoring

As [we described previously](../chorus-one/monitoring-alerting.md),
at Chorus One we use [Prometheus][prometheus] for monitoring and alerting.
we use [Prometheus][prometheus] for monitoring and alerting.
This is the industry-standard monitoring protocol
that is supported by most software we run.

Exposing metrics is essential for any blockchain project.
Without it, the node software is a black box to us,
and the only thing we could observe is whether the process is still running,
which is not the same as being healthy.
We need to know what’s going on _inside_ that process,
and the standard way of doing that is through logs and Prometheus metrics.

[prometheus]: https://prometheus.io/

## Prometheus
Expand All @@ -21,7 +28,7 @@ Alternatively, a dedicated metrics port works fine too.

While the set of metrics is of course application-specific,
blockchain networks generally have a concept of the _block height_.
Note that unless the block height is for a finalized fork,
Unless the block height is for a finalized fork,
block height is generally a [gauge][prometheus-gauge]
and not a [counter][prometheus-counter].

Expand All @@ -31,16 +38,17 @@ and not a [counter][prometheus-counter].

#### Expose metrics privately. {.p1 #expose-metrics-privately}

While _we_ want to scape metrics,
we don’t want to expose confidential information to third parties.
We need to scape metrics internally,
but we don’t want to expose confidential information to third parties.
It should be possible for the http server that serves the `/metrics` endpoint
to listen on a network interface that is not Internet-exposed.

#### Respect Prometheus metric and label naming standards. {.p3 #respect-prometheus-standards}

Prometheus [has an official standard for naming metrics and labels][prometheus-naming].
Following the standard ensures that metrics are self-explanatory and easy to use,
and that our alerting configuration is consistent and uniform. In particular:
and enables us to write alerting configuration that is consistent and uniform.
In particular:

* Prefix the metric with the name of your application.
* Metrics should use base units (bytes and seconds, not kilobytes or milliseconds).
Expand All @@ -50,13 +58,34 @@ and that our alerting configuration is consistent and uniform. In particular:

[prometheus-naming]: https://prometheus.io/docs/practices/naming/

<!-- TODO: Finish this section
## Telementry
TODO.
* Have a way to disable.
* We are fine to share on incentivized testnets.
* We do not grant SSH access to our infrastructure, period.
-->
## Telemetry

We understand that node software authors
need visibility into how their software runs to inform development
— that is the reason we are publishing this network handbook in the first place.
However, we are subject to legal and compliance requirements,
which mean that we cannot always allow software to phone home.
In particular,
in some cases we are under non-disclosure agreements.

On incentivized testnets we are happy to share telemetry data.
In these cases we only operate our own identity,
and the risk of telemetry exposing confidential information is low.
For mainnets we do not allow telemetry data to be shared.

#### Ensure telemetry can be disabled. {.p2 #telemetry-can-be-disabled}
As described above,
some confidential information we cannot share for legal and compliance reasons.
The easiest way to prevent inadvertently exposing confidential information,
is to expose as little information as possible.

## Troubleshooting

In case of bugs that are difficult to reproduce,
we are happy to work with you to share relevant information, logs,
try patches, etc.
**Under no circumstance
does Chorus One grant access to our infrastructure to third parties.**
We definitely do not grant SSH access or other forms of remote access.
If we did,
we would not be able to guarantee the integrity of our infrastructure.

0 comments on commit be5c9db

Please sign in to comment.