@@ -5,234 +5,24 @@ title: arangod Server Metrics
5
5
---
6
6
# ArangoDB Server Metrics
7
7
8
- _ arangod_ exports metrics which can be used to monitor the healthiness and
9
- performance of the system. Out of all exposed metrics the most relevant ones
10
- are highlighted below. In addition, the thresholds for alerts are described.
8
+ _ arangod_ exports metrics in Prometheus format which can be used to monitor
9
+ the healthiness and performance of the system. The thresholds for alerts are also described.
11
10
12
11
{% hint 'warning' %}
13
12
The list of exposed metrics is subject to change in every minor version.
14
13
While they should stay backwards compatible for the most part, some metrics are
15
14
coupled to specific internals that may be replaced by other mechanisms in the
16
15
future.
17
-
18
- Below monitoring recommendations are limited to those metrics that are
19
- considered future-proof. If you setup your monitoring to use the
20
- recommendations described here, you can safely upgrade to new versions.
21
16
{% endhint %}
22
17
23
- ## Cluster Health
24
-
25
- This group of metrics are used to measure how healthy the cluster processes
26
- are and if they can communicate properly to one another.
27
-
28
- ### Heartbeats
29
-
30
- Heartbeats are a core mechanism in ArangoDB Clusters to define the liveliness
31
- of Servers. Every server will send heartbeats to the agency, if too many heartbeats
32
- are skipped or cannot be delivered in time, the server is declared dead and a
33
- failover of the data will be triggered.
34
- By default we expect at least 1 heartbeat per second.
35
- If a server does not deliver 5 heartbeats in a row (5 seconds without a single
36
- heartbeat) it is considered dead.
37
-
38
- ** Metric**
39
- - ` arangodb_heartbeat_send_time_msec ` :
40
- The time a single heartbeat took to be delivered.
41
- - ` arangodb_heartbeat_failures ` :
42
- Amount of heartbeats which this server failed to deliver.
43
-
44
- ** Exposed by**
45
- Coordinator, DB-Server
46
-
47
- ** Threshold**
48
- - For ` arangodb_heartbeat_send_time_msec ` :
49
- - Depending on your network latency, we typically expect this to be somewhere
50
- below 100ms.
51
- - below 1000ms is still acceptable but not great.
52
- - 1000ms - 3000ms is considered critical, but cluster should still operate,
53
- consider contacting our support.
54
- - above 3000ms expect outages! If any of these fails to deliver, the server
55
- will be flagged as dead and we will trigger failover. With this timing the
56
- failovers will most likely stack up and cause more trouble.
57
-
58
- - For ` arangodb_heartbeat_failures ` :
59
- - Typically this should be 0.
60
- - If you see any other value here this is typically a network hiccup.
61
- - If this is constantly growing this means the server is somehow undergoing a
62
- network split.
63
-
64
- ** Troubleshoot**
65
-
66
- Heartbeats are precious and on the fastest possible path internally. If they
67
- slow down or cannot be delivered this in almost all cases can be appointed to
68
- network issues. If you constantly have this on a high value please make sure
69
- the latency between your cluster machines and all agents is low, this will be
70
- a lower bound for the value we achieve here. If this is not the case, the
71
- network might be overloaded.
72
-
73
- ## Agency Plan Sync on DB-Servers
74
-
75
- In order to update the data definition on databases servers from the definition stored in the agency, DB-Servers have a repeated
76
- job called Agency Plan Sync. Timings for collection and database creations are strongly correlated to the overall runtime
77
- of this job.
78
-
79
- ** Metric**
80
- - ` arangodb_maintenance_agency_sync_runtime_msec ` :
81
- Histogram containing the runtimes of individual runs.
82
- - ` arangodb_maintenance_agency_sync_accum_runtime_msec ` :
83
- The accumulated runtime of all runs.
84
-
85
- ** Exposed by**
86
- DB-Server
87
-
88
- ** Threshold**
89
- - For ` arangodb_maintenance_agency_sync_runtime_msec ` :
90
- - This should not exceed 1000ms.
91
-
92
- ** Troubleshoot**
93
-
94
- If the Agency Plan Sync becomes the bottleneck of database and collection
95
- distribution you should consider reducing the amount of those.
96
-
97
- ### Shard Distribution
98
-
99
- Allows insight in the shard distribution in the cluster and the state of
100
- replication of the data.
101
-
102
- ** Metric**
103
- - ` arangodb_shards_out_of_sync ` :
104
- Number of shards not replicated with their required replication factor
105
- (for which this server is the leader)
106
- - ` arangodb_shards_total_count ` :
107
- Number of shards located on this server (leader _ and_ follower shards)
108
- - ` arangodb_shards_leader_count ` :
109
- Number of shards for which this server is the leader.
110
- - ` arangodb_shards_not_replicated ` :
111
- Number of shards that are not replicated, i.e. this data is at risk as there
112
- is no other copy available.
113
-
114
- ** Exposed by**
115
- DB-Server
116
-
117
- ** Threshold**
118
- - For ` arangodb_shards_out_of_sync ` :
119
- - Eventually all shards should be in sync and this value equal to zero.
120
- - It can increase when new collections are created or servers are rotated.
121
- - For ` arangodb_shards_total_count ` and ` arangodb_shards_leader_count ` :
122
- - This value should be roughly equal for all servers.
123
- - For ` arangodb_shards_not_replicated ` :
124
- - This value _ should_ be zero at all times. If not, you currently have a
125
- single point of failure and data is at risk. Please contact our support team.
126
- - This can happen if you lose 1 DB-Server and have ` replicationFactor ` 2, if
127
- you lose 2 DB-Servers on ` replicationFactor ` 3 and so on. In this cases the
128
- system will try to heal itself, if enough healthy servers remain.
129
-
130
- ** Troubleshoot**
131
-
132
- The distribution of shards should be roughly equal. If not please consider
133
- rebalancing shards.
134
-
135
- ### Scheduler
136
-
137
- The Scheduler is responsible for managing growing workloads and distributing
138
- tasks across the available threads. Whenever more work is available than the
139
- system can handle, it adjusts the number of threads. The scheduler maintains an
140
- internal queue for tasks ready for execution. A constantly growing queue is a
141
- clear sign for the system reaching its limits.
142
-
143
- ** Metric**
144
- - ` arangodb_scheduler_queue_length ` :
145
- Length of the internal task queue.
146
- - ` arangodb_scheduler_awake_threads ` :
147
- Number of actively working threads.
148
- - ` arangodb_scheduler_num_worker_threads ` :
149
- Total number of currently available threads.
150
-
151
- ** Exposed by**
152
- Coordinator, DB-Server, Agents
153
-
154
- ** Threshold**
155
- - For ` arangodb_scheduler_queue_length ` :
156
- - Typically this should be 0.
157
- - Having an non-zero queue length is not a problem as long as it eventually
158
- becomes smaller again. This can happen for example during load spikes.
159
- - Having a longer queue results in bigger latencies as the requests need to
160
- wait longer before they are executed.
161
- - If the queue runs full you will eventually get a ` queue full ` error.
162
- - For ` arangodb_scheduler_num_worker_threads ` and
163
- ` arangodb_scheduler_awake_threads ` :
164
- - They should increase as load increases.
165
- - If the queue length is non-zero for more than a minute you _ should_ see
166
- ` arangodb_scheduler_awake_threads == arangodb_scheduler_num_worker_threads ` .
167
- If not, consider contacting our support.
168
-
169
- ** Troubleshoot**
170
-
171
- Queuing requests will result in bigger latency. If your queue is constantly
172
- growing, you should consider scaling your system according to your needs.
173
- Remember to rebalance shards if you scale up DB-Servers.
174
-
175
- ** Metric**
176
- - ` arangodb_scheduler_queue_full_failures ` :
177
- Number of times a request/task could not be added to the scheduler queue
178
- because the queue was full. If this happens, the corresponding request will
179
- be responded to with an HTTP 503 ("Service Unavailable") response.
180
-
181
- ** Exposed by**
182
- Coordinator, DB-Server, Agents
183
-
184
- ** Threshold**
185
- - For ` arangodb_scheduler_queue_full_failures ` :
186
- - This should be 0, as dropping requests is an extremely undesirable event.
187
-
188
- ** Troubleshoot**
189
-
190
- If the number of queue full failures is greater than zero and even growing over
191
- time, it indicates that the server (or one of the server in a cluster) is
192
- overloaded and cannot keep up with the workload. There are many potential
193
- reasons for this, e.g. servers with too little capacity, spiky workloads,
194
- or even network connectivity issues. Whenever this problem happens, it needs
195
- further detailed analysis of what could be the root cause.
196
-
197
- ### Supervision
198
-
199
- The supervision is an integral part of the cluster and runs on the leading
200
- agent. It is responsible for handling MoveShard jobs and server failures.
201
- It is intended to run every second, thus its runime _ should_ be below one second.
202
-
203
- ** Metric**
204
- - ` arangodb_agency_supervision_runtime_msec ` :
205
- Time in ms of a single supervision run.
206
- - ` arangodb_agency_supervision_runtime_wait_for_replication_msec ` :
207
- Time the supervision has to wait for its decisions to be committed.
208
-
209
- ** Exposed by**
210
- Agents
211
-
212
- ** Threshold**
213
- - For ` agency_supervision_runtime_msec ` :
214
- - This value should stay below 1000ms. However, when a DB-Server is rotated
215
- there can be single runs that have much higher runtime. However, this
216
- should not be true in general.
217
-
218
- This value will only increase for the leading agent.
219
-
220
- ** Troubleshoot**
221
-
222
- If the supervision is not able to run approximately once per second, cluster
223
- resilience is affected. Please consider contacting our support.
224
-
225
18
## Metrics API v2
226
19
20
+ {% docublock get_admin_metrics_v2 %}
21
+
227
22
{% include metrics.md %}
228
23
229
24
## Metrics API (deprecated)
230
25
231
- {% hint 'warning' %}
232
- The endpoint ` GET /_api/metrics ` is deprecated from v3.8.0 on and will be
233
- removed in a future version. Please switch to ` GET /_api/metrics/v2 ` .
234
- {% endhint %}
235
-
236
26
<!-- js/actions/api-system.js -->
237
27
{% docublock get_admin_metrics %}
238
28
0 commit comments