diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 55c9128..1686b48 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -16,8 +16,8 @@ - [Software development best practices](node-software/development-practices.md) - [Build process](node-software/build-process.md) - [Release engineering](node-software/release-engineering.md) + - [Monitoring](node-software/monitoring.md) - [Communication channels]() - [Runtime environment]() - [Hardware requirements]() - - [Monitoring](node-software/monitoring.md) - [Operator interface]() diff --git a/src/node-software/monitoring.md b/src/node-software/monitoring.md index 8141a88..f0be6a3 100644 --- a/src/node-software/monitoring.md +++ b/src/node-software/monitoring.md @@ -1,10 +1,17 @@ # Monitoring As [we described previously](../chorus-one/monitoring-alerting.md), -at Chorus One we use [Prometheus][prometheus] for monitoring and alerting. +we use [Prometheus][prometheus] for monitoring and alerting. This is the industry-standard monitoring protocol that is supported by most software we run. +Exposing metrics is essential for any blockchain project. +Without it, the node software is a black box to us, +and the only thing we could observe is whether the process is still running, +which is not the same as being healthy. +We need to know what’s going on _inside_ that process, +and the standard way of doing that is through logs and Prometheus metrics. + [prometheus]: https://prometheus.io/ ## Prometheus @@ -21,7 +28,7 @@ Alternatively, a dedicated metrics port works fine too. While the set of metrics is of course application-specific, blockchain networks generally have a concept of the _block height_. -Note that unless the block height is for a finalized fork, +Unless the block height is for a finalized fork, block height is generally a [gauge][prometheus-gauge] and not a [counter][prometheus-counter]. @@ -31,8 +38,8 @@ and not a [counter][prometheus-counter]. #### Expose metrics privately. {.p1 #expose-metrics-privately} -While _we_ want to scape metrics, -we don’t want to expose confidential information to third parties. +We need to scape metrics internally, +but we don’t want to expose confidential information to third parties. It should be possible for the http server that serves the `/metrics` endpoint to listen on a network interface that is not Internet-exposed. @@ -40,7 +47,8 @@ to listen on a network interface that is not Internet-exposed. Prometheus [has an official standard for naming metrics and labels][prometheus-naming]. Following the standard ensures that metrics are self-explanatory and easy to use, -and that our alerting configuration is consistent and uniform. In particular: +and enables us to write alerting configuration that is consistent and uniform. +In particular: * Prefix the metric with the name of your application. * Metrics should use base units (bytes and seconds, not kilobytes or milliseconds). @@ -50,13 +58,34 @@ and that our alerting configuration is consistent and uniform. In particular: [prometheus-naming]: https://prometheus.io/docs/practices/naming/ - +## Telemetry + +We understand that node software authors +need visibility into how their software runs to inform development +— that is the reason we are publishing this network handbook in the first place. +However, we are subject to legal and compliance requirements, +which mean that we cannot always allow software to phone home. +In particular, +in some cases we are under non-disclosure agreements. + +On incentivized testnets we are happy to share telemetry data. +In these cases we only operate our own identity, +and the risk of telemetry exposing confidential information is low. +For mainnets we do not allow telemetry data to be shared. + +#### Ensure telemetry can be disabled. {.p2 #telemetry-can-be-disabled} +As described above, +some confidential information we cannot share for legal and compliance reasons. +The easiest way to prevent inadvertently exposing confidential information, +is to expose as little information as possible. + +## Troubleshooting + +In case of bugs that are difficult to reproduce, +we are happy to work with you to share relevant information, logs, +try patches, etc. +**Under no circumstance +does Chorus One grant access to our infrastructure to third parties.** +We definitely do not grant SSH access or other forms of remote access. +If we did, +we would not be able to guarantee the integrity of our infrastructure.