Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add detail to tmpnet metrics documentation #2854

Merged
merged 4 commits into from
Mar 26, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 81 additions & 22 deletions tests/fixture/tmpnet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,38 +234,75 @@ The process details of a node are written by avalanchego to
process, the URI of the node's API, and the address other nodes can
use to bootstrap themselves (aka staking address).

## Metrics

### Prometheus configuration

When nodes are started, prometheus configuration for each node is
written to `~/.tmpnet/prometheus/file_sd_configs/` with a filename of
`[network uuid]-[node id].json`. Prometheus can be configured to
scrape the nodes as per the following example:

```yaml
scrape_configs:
- job_name: "avalanchego"
metrics_path: "/ext/metrics"
file_sd_configs:
- files:
- '/home/me/.tmpnet/prometheus/file_sd_configs/*.yaml'
## Monitoring

Monitoring is an essential part of understanding the workings of a
distributed system such as avalanchego. The tmpnet fixture enables
collection of logs and metrics from temporary networks to a monitoring
stack (prometheus+loki+grafana) to enable results to be analyzed and
shared.

### Example usage

```bash
# Start prometheus to collect metrics
PROMETHEUS_ID=<id> PROMETHEUS_PASSWORD=<password> ./scripts/run_prometheus.sh

# Start promtail to collect logs
LOKI_ID=<id> LOKI_PASSWORD=<password> ./scripts/run_promtail.sh

# Network start emits link to grafana displaying collected logs and metrics
./build/tmpnetctl start-network
```

### Viewing metrics
### Metrics collection

When a node is started, configuration enabling collection of metrics
from the node is written to
`~/.tmpnet/prometheus/file_sd_configs/[network uuid]-[node id].json`.

The `scripts/run_prometheus.sh` script starts prometheus in agent mode
configured to scrape metrics from configured nodes and forward the
metrics to a persistent prometheus instance. The script requires that
the `PROMETHEUS_ID` and `PROMETHEUS_PASSWORD` env vars be set. By
default the prometheus instance at
https://prometheus-experimental.avax-dev.network will be targeted and
this can be overridden via the `PROMETHEUS_URL` env var.

### Log collection

When a network is started with `tmpnet`, a grafana link for the
network's metrics will be emitted.
Nodes log are stored at `~/.tmpnet/networks/[network id]/[node
id]/logs` by default, and can optionally be forwarded to loki with
promtail.

The metrics emitted by temporary networks configured with tmpnet will
have the following labels applied:
When a node is started, promtail configuration enabling
collection of logs for the node is written to
`~/.tmpnet/promtail/file_sd_configs/[network
uuid]-[node id].json`.

The `scripts/run_promtail.sh` script starts promtail configured to
collect logs from configured nodes and forward the results to loki. The
script requires that the `LOKI_ID` and `LOKI_PASSWORD` env vars be
set. By default the loki instance at
https://loki-experimental.avax-dev.network will be targeted and this
can be overridden via the `LOKI_URL` env var.

### Labels

The logs and metrics collected for temporary networks will have the
following labels applied:

- `network_uuid`
- uniquely identifies a network across hosts
- `node_id`
- `is_ephemeral_node`
- 'ephemeral' nodes are expected to run for only a fraction of the
life of a network
- `network_owner`
- an arbitrary string that can be used to differentiate results
when a CI job runs more than one network

When a tmpnet network runs as part of github CI, the following
When a network runs as part of a github CI job, the following
additional labels will be applied:

- `gh_repo`
Expand All @@ -274,3 +311,25 @@ additional labels will be applied:
- `gh_run_number`
- `gh_run_attempt`
- `gh_job_id`

These labels are sourced from Github Actions' `github` context as per
https://docs.github.com/en/actions/learn-github-actions/contexts#github-context.

### Viewing

#### Local networks

When a network is started with tmpnet, a link to the [default grafana
instance](https://grafana-experimental.avax-dev.network) will be
emitted. The dashboards will only be populated if prometheus and
promtail are running locally (as per previous sections) to collect
metrics and logs.

#### CI

Collection of logs and metrics is enabled for CI jobs that use
tmpnet. Each job will execute a step titled `Notify of metrics
availability` that emits a link to grafana parametized to show results
for the job. Additional links to grafana parametized to show results
for individual network will appear in the logs displaying the start of
those networks.
Loading