Skip to content
This repository was archived by the owner on Dec 13, 2023. It is now read-only.

Commit 04de28d

Browse files
committed
Remove hand-written section, extend include and update for Metrics v2
1 parent 059c323 commit 04de28d

File tree

2 files changed

+43
-217
lines changed

2 files changed

+43
-217
lines changed

3.8/http/administration-and-monitoring-metrics.md

Lines changed: 4 additions & 214 deletions
Original file line numberDiff line numberDiff line change
@@ -5,234 +5,24 @@ title: arangod Server Metrics
55
---
66
# ArangoDB Server Metrics
77

8-
_arangod_ exports metrics which can be used to monitor the healthiness and
9-
performance of the system. Out of all exposed metrics the most relevant ones
10-
are highlighted below. In addition, the thresholds for alerts are described.
8+
_arangod_ exports metrics in Prometheus format which can be used to monitor
9+
the healthiness and performance of the system. The thresholds for alerts are also described.
1110

1211
{% hint 'warning' %}
1312
The list of exposed metrics is subject to change in every minor version.
1413
While they should stay backwards compatible for the most part, some metrics are
1514
coupled to specific internals that may be replaced by other mechanisms in the
1615
future.
17-
18-
Below monitoring recommendations are limited to those metrics that are
19-
considered future-proof. If you setup your monitoring to use the
20-
recommendations described here, you can safely upgrade to new versions.
2116
{% endhint %}
2217

23-
## Cluster Health
24-
25-
This group of metrics are used to measure how healthy the cluster processes
26-
are and if they can communicate properly to one another.
27-
28-
### Heartbeats
29-
30-
Heartbeats are a core mechanism in ArangoDB Clusters to define the liveliness
31-
of Servers. Every server will send heartbeats to the agency, if too many heartbeats
32-
are skipped or cannot be delivered in time, the server is declared dead and a
33-
failover of the data will be triggered.
34-
By default we expect at least 1 heartbeat per second.
35-
If a server does not deliver 5 heartbeats in a row (5 seconds without a single
36-
heartbeat) it is considered dead.
37-
38-
**Metric**
39-
- `arangodb_heartbeat_send_time_msec`:
40-
The time a single heartbeat took to be delivered.
41-
- `arangodb_heartbeat_failures`:
42-
Amount of heartbeats which this server failed to deliver.
43-
44-
**Exposed by**
45-
Coordinator, DB-Server
46-
47-
**Threshold**
48-
- For `arangodb_heartbeat_send_time_msec`:
49-
- Depending on your network latency, we typically expect this to be somewhere
50-
below 100ms.
51-
- below 1000ms is still acceptable but not great.
52-
- 1000ms - 3000ms is considered critical, but cluster should still operate,
53-
consider contacting our support.
54-
- above 3000ms expect outages! If any of these fails to deliver, the server
55-
will be flagged as dead and we will trigger failover. With this timing the
56-
failovers will most likely stack up and cause more trouble.
57-
58-
- For `arangodb_heartbeat_failures`:
59-
- Typically this should be 0.
60-
- If you see any other value here this is typically a network hiccup.
61-
- If this is constantly growing this means the server is somehow undergoing a
62-
network split.
63-
64-
**Troubleshoot**
65-
66-
Heartbeats are precious and on the fastest possible path internally. If they
67-
slow down or cannot be delivered this in almost all cases can be appointed to
68-
network issues. If you constantly have this on a high value please make sure
69-
the latency between your cluster machines and all agents is low, this will be
70-
a lower bound for the value we achieve here. If this is not the case, the
71-
network might be overloaded.
72-
73-
## Agency Plan Sync on DB-Servers
74-
75-
In order to update the data definition on databases servers from the definition stored in the agency, DB-Servers have a repeated
76-
job called Agency Plan Sync. Timings for collection and database creations are strongly correlated to the overall runtime
77-
of this job.
78-
79-
**Metric**
80-
- `arangodb_maintenance_agency_sync_runtime_msec`:
81-
Histogram containing the runtimes of individual runs.
82-
- `arangodb_maintenance_agency_sync_accum_runtime_msec`:
83-
The accumulated runtime of all runs.
84-
85-
**Exposed by**
86-
DB-Server
87-
88-
**Threshold**
89-
- For `arangodb_maintenance_agency_sync_runtime_msec`:
90-
- This should not exceed 1000ms.
91-
92-
**Troubleshoot**
93-
94-
If the Agency Plan Sync becomes the bottleneck of database and collection
95-
distribution you should consider reducing the amount of those.
96-
97-
### Shard Distribution
98-
99-
Allows insight in the shard distribution in the cluster and the state of
100-
replication of the data.
101-
102-
**Metric**
103-
- `arangodb_shards_out_of_sync`:
104-
Number of shards not replicated with their required replication factor
105-
(for which this server is the leader)
106-
- `arangodb_shards_total_count`:
107-
Number of shards located on this server (leader _and_ follower shards)
108-
- `arangodb_shards_leader_count`:
109-
Number of shards for which this server is the leader.
110-
- `arangodb_shards_not_replicated`:
111-
Number of shards that are not replicated, i.e. this data is at risk as there
112-
is no other copy available.
113-
114-
**Exposed by**
115-
DB-Server
116-
117-
**Threshold**
118-
- For `arangodb_shards_out_of_sync`:
119-
- Eventually all shards should be in sync and this value equal to zero.
120-
- It can increase when new collections are created or servers are rotated.
121-
- For `arangodb_shards_total_count` and `arangodb_shards_leader_count`:
122-
- This value should be roughly equal for all servers.
123-
- For `arangodb_shards_not_replicated`:
124-
- This value _should_ be zero at all times. If not, you currently have a
125-
single point of failure and data is at risk. Please contact our support team.
126-
- This can happen if you lose 1 DB-Server and have `replicationFactor` 2, if
127-
you lose 2 DB-Servers on `replicationFactor` 3 and so on. In this cases the
128-
system will try to heal itself, if enough healthy servers remain.
129-
130-
**Troubleshoot**
131-
132-
The distribution of shards should be roughly equal. If not please consider
133-
rebalancing shards.
134-
135-
### Scheduler
136-
137-
The Scheduler is responsible for managing growing workloads and distributing
138-
tasks across the available threads. Whenever more work is available than the
139-
system can handle, it adjusts the number of threads. The scheduler maintains an
140-
internal queue for tasks ready for execution. A constantly growing queue is a
141-
clear sign for the system reaching its limits.
142-
143-
**Metric**
144-
- `arangodb_scheduler_queue_length`:
145-
Length of the internal task queue.
146-
- `arangodb_scheduler_awake_threads`:
147-
Number of actively working threads.
148-
- `arangodb_scheduler_num_worker_threads`:
149-
Total number of currently available threads.
150-
151-
**Exposed by**
152-
Coordinator, DB-Server, Agents
153-
154-
**Threshold**
155-
- For `arangodb_scheduler_queue_length`:
156-
- Typically this should be 0.
157-
- Having an non-zero queue length is not a problem as long as it eventually
158-
becomes smaller again. This can happen for example during load spikes.
159-
- Having a longer queue results in bigger latencies as the requests need to
160-
wait longer before they are executed.
161-
- If the queue runs full you will eventually get a `queue full` error.
162-
- For `arangodb_scheduler_num_worker_threads` and
163-
`arangodb_scheduler_awake_threads`:
164-
- They should increase as load increases.
165-
- If the queue length is non-zero for more than a minute you _should_ see
166-
`arangodb_scheduler_awake_threads == arangodb_scheduler_num_worker_threads`.
167-
If not, consider contacting our support.
168-
169-
**Troubleshoot**
170-
171-
Queuing requests will result in bigger latency. If your queue is constantly
172-
growing, you should consider scaling your system according to your needs.
173-
Remember to rebalance shards if you scale up DB-Servers.
174-
175-
**Metric**
176-
- `arangodb_scheduler_queue_full_failures`:
177-
Number of times a request/task could not be added to the scheduler queue
178-
because the queue was full. If this happens, the corresponding request will
179-
be responded to with an HTTP 503 ("Service Unavailable") response.
180-
181-
**Exposed by**
182-
Coordinator, DB-Server, Agents
183-
184-
**Threshold**
185-
- For `arangodb_scheduler_queue_full_failures`:
186-
- This should be 0, as dropping requests is an extremely undesirable event.
187-
188-
**Troubleshoot**
189-
190-
If the number of queue full failures is greater than zero and even growing over
191-
time, it indicates that the server (or one of the server in a cluster) is
192-
overloaded and cannot keep up with the workload. There are many potential
193-
reasons for this, e.g. servers with too little capacity, spiky workloads,
194-
or even network connectivity issues. Whenever this problem happens, it needs
195-
further detailed analysis of what could be the root cause.
196-
197-
### Supervision
198-
199-
The supervision is an integral part of the cluster and runs on the leading
200-
agent. It is responsible for handling MoveShard jobs and server failures.
201-
It is intended to run every second, thus its runime _should_ be below one second.
202-
203-
**Metric**
204-
- `arangodb_agency_supervision_runtime_msec`:
205-
Time in ms of a single supervision run.
206-
- `arangodb_agency_supervision_runtime_wait_for_replication_msec`:
207-
Time the supervision has to wait for its decisions to be committed.
208-
209-
**Exposed by**
210-
Agents
211-
212-
**Threshold**
213-
- For `agency_supervision_runtime_msec`:
214-
- This value should stay below 1000ms. However, when a DB-Server is rotated
215-
there can be single runs that have much higher runtime. However, this
216-
should not be true in general.
217-
218-
This value will only increase for the leading agent.
219-
220-
**Troubleshoot**
221-
222-
If the supervision is not able to run approximately once per second, cluster
223-
resilience is affected. Please consider contacting our support.
224-
22518
## Metrics API v2
22619

20+
{% docublock get_admin_metrics_v2 %}
21+
22722
{% include metrics.md %}
22823

22924
## Metrics API (deprecated)
23025

231-
{% hint 'warning' %}
232-
The endpoint `GET /_api/metrics` is deprecated from v3.8.0 on and will be
233-
removed in a future version. Please switch to `GET /_api/metrics/v2`.
234-
{% endhint %}
235-
23626
<!-- js/actions/api-system.js -->
23727
{% docublock get_admin_metrics %}
23828

_includes/metrics.md

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,39 @@
1-
{% for m in site.data.allMetrics -%}
2-
- {{ m.description }}
3-
{% endfor %}
1+
{% assign groups = site.data.allMetrics | group_by:"category" -%}
2+
{% for group in groups -%}
3+
### {{ group.name }}
4+
5+
{% for metric in group.items -%}
6+
<strong>{{ metric.help }}</strong>
7+
8+
`{{ metric.name }}`
9+
10+
{{ metric.description }}
11+
12+
{% if metric.introducedIn or metric.renamedFrom -%}
13+
<small>
14+
{%- if metric.introducedIn %}Introduced in: v{{ metric.introducedIn }}{% endif -%}
15+
{% if metric.introducedIn and metric.renamedFrom -%}. {% endif -%}
16+
{% if metric.renamedFrom -%}Renamed from: `{{ metric.renamedFrom }}`{% endif -%}
17+
</small>
18+
{%- endif %}
19+
20+
| Type | Unit | Complexity | Exposed by |
21+
|:-----|:-----|:-----------|:-----------|
22+
| {{ metric.type }} | {{ metric.unit }} | {{ metric.complexity }} | {{ metric.exposedBy | capitalize_components | join_natural }} |
23+
24+
{% if metric.threshold -%}
25+
**Threshold:**
26+
{{ metric.threshold }}
27+
{% endif -%}
28+
29+
{% if metric.troubleshoot -%}
30+
**Troubleshoot:**
31+
{{ metric.troubleshoot }}
32+
{% endif -%}
33+
34+
{% if forloop.last %}{% else %}
35+
---
36+
{% endif -%}
37+
38+
{% endfor -%}
39+
{% endfor -%}

0 commit comments

Comments
 (0)