Skip to content

Commit 690c491

Browse files
committed
initial dump of cluster admin chapter
1 parent fb6af97 commit 690c491

File tree

6 files changed

+685
-15
lines changed

6 files changed

+685
-15
lines changed

300_Aggregations/110_docvalues.asciidoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
1+
[[doc_values]]
22
=== Doc Values
33

44
The default data structure for field data is called _paged-bytes_, and it is

500_Cluster_Admin.asciidoc

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,27 @@
11
[[cluster-admin]]
2-
== Cluster management and monitoring (TODO)
2+
== Cluster management and monitoring
3+
include::500_Cluster_Admin/10_intro.asciidoc[]
34

4-
This chapter discusses how cluster management and monitoring.
5+
include::500_Cluster_Admin/15_marvel.asciidoc[]
56

6-
=== Cluster health
7-
.
7+
include::500_Cluster_Admin/20_health.asciidoc[]
88

9+
include::500_Cluster_Admin/30_node_stats.asciidoc[]
910

10-
=== Cluster settings
11-
.
12-
13-
14-
=== Nodes stats
15-
.
16-
17-
18-
=== Nodes info
19-
.
11+
- management
12+
- cluster settings
13+
- dynamically changing logging
14+
- index settings
15+
2016

17+
- monitoring
18+
- marvel
19+
- cluster health
20+
- cluster stats
21+
- node stats / node info
22+
- rejections
23+
- index stats
24+
- hot threads
25+
- pending tasks
2126

27+
- cat api
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
2+
3+
4+
Elasticsearch is often deployed as a cluster of nodes. There are a variety of
5+
APIs that let you manage and monitor the cluster itself, rather than interact
6+
with the data stored within the cluster.
7+
8+
As with most functionality in Elasticsearch, there is an over-arching design goal
9+
that tasks should be performed through an API rather than by modifying static
10+
configuration files. This becomes especially important as your cluster scales.
11+
Even with a provisioning system (puppet, chef, ansible, etc), a single HTTP API call
12+
is often simpler than pushing new configurations to hundreds of physical machines.
13+
14+
To that end, this chapter will be discussing the various APIs that allow you to
15+
dynamically tweak, tune and configure your cluster. We will also cover a
16+
host of APIs that provide statistics about the cluster itself so that you can
17+
monitor for health and performance.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
2+
=== Marvel for Monitoring
3+
4+
At the very beginning of the book (<<marvel>>) we encouraged you to install
5+
Marvel, a management monitoring tool for Elasticsearch, because it would enable
6+
interactive code samples throughout the book.
7+
8+
If you didn't install Marvel then, we encourage you to install it now. This
9+
chapter will introduce a large number of APIs that emit an even larger number
10+
of statistics. These stats track everything from heap memory usage and garbage
11+
collection counts to open file descriptors. These statistics are invaluable
12+
for debugging a misbehaving cluster.
13+
14+
The problem is that these APIs provide a single data point -- the statistic
15+
_right now_. Often you'll want to see historical data too, so that you can
16+
plot a trend. Knowing memory usage at this instant is helpful, but knowing
17+
memory usage _over time_ is much more useful.
18+
19+
Furthermore, the output of these APIs can get truly hairy as your cluster grows.
20+
Once you have a dozen nodes, let alone a hundred, reading through stacks of JSON
21+
becomes very tedious.
22+
23+
Marvel periodically polls these APIs and stores the data back in Elasticsearch.
24+
This allows Marvel to query and aggregate the metrics, then provide interactive
25+
graphs in your browser. There are no proprietary statistics that Marvel exposes;
26+
it uses the same stats APIs that are accessible to you. But it does greatly
27+
simplify the collection and graphing of those statistics.
28+
29+
Marvel is free to use in development, so you should definitely try it out!
Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
2+
=== Cluster Health
3+
4+
An Elasticsearch cluster may consist of a single node with a single index. Or it
5+
may have a hundred data nodes, three dedicated masters, a few dozen clients nodes
6+
...all operating on a thousand indices (and tends of thousands of shards).
7+
8+
No matter the scale of the cluster, you'll want a quick way to assess the status
9+
of your cluster. The _Cluster Health_ API fills that role. You can think of it
10+
as a ten-thousand foot view of your cluster. It can reassure you that everything
11+
is alright, or alert you to a problem somewhere in your cluster.
12+
13+
Let's execute a Health API and see what the response looks like:
14+
15+
[source,bash]
16+
----
17+
GET _cluster/health
18+
----
19+
20+
Like other APIs in Elasticsearch, Cluster Health will return a JSON response.
21+
This makes it convenient to parse for automation and alerting. The response
22+
contains some critical information about your cluster:
23+
24+
[source,js]
25+
----
26+
{
27+
"cluster_name": "elasticsearch_zach",
28+
"status": "green",
29+
"timed_out": false,
30+
"number_of_nodes": 1,
31+
"number_of_data_nodes": 1,
32+
"active_primary_shards": 10,
33+
"active_shards": 10,
34+
"relocating_shards": 0,
35+
"initializing_shards": 0,
36+
"unassigned_shards": 0
37+
}
38+
----
39+
40+
The most important piece of information in the response is the `"status"` field.
41+
The status may be one of three values:
42+
43+
- *Green:* all primary and replica shards are allocated. Your cluster is 100%
44+
operational
45+
- *Yellow:* all primary shards are allocated, but at least one replica is missing.
46+
No data is missing so search results will still be complete. However, your
47+
high-availability is compromised to some degree. If _more_ shards disappear, you
48+
might lose data. Think of Yellow as a warning which should prompt investigation.
49+
- *Red:* at least one primary shard (and all of it's replicas) are missing. This means
50+
that you are missing data: searches will return partial results and indexing
51+
into that shard will return an exception.
52+
53+
The Green/Yellow/Red status is a great way to glance at your cluster and understand
54+
what's going on. The rest of the metrics give you a general summary of your cluster:
55+
56+
- `number_of_nodes` and `number_of_data_nodes` are fairly self-descriptive.
57+
- `active_primary_shards` are the number of primary shards in your cluster. This
58+
is an aggregate total across all indices.
59+
- `active_shards` is an aggregate total of _all_ shards across all indices, which
60+
includes replica shards
61+
- `relocating_shards` shows the number of shards that are currently moving from
62+
one node to another node. This number is often zero, but can increase when
63+
Elasticsearch decides a cluster is not properly balanced, a new node is added,
64+
a node is taken down, etc.
65+
- `initializing_shards` is a count of shards that are being freshly created. For
66+
example, when you first create an index, the shards will all briefly reside in
67+
"initializing" state. This is typically a transient event and shards shouldn't
68+
linger in "initializing" too long. You may also see initializing shards when a
69+
node is first restarted...as shards are loaded from disk they start as "initializing"
70+
- `unassigned_shards` are shards that exist in the cluster state, but cannot be
71+
found in the cluster itself. A common source of unassigned shards are unassigned
72+
replicas. For example, an index with 5 shards and 1 replica will have 5 unassigned
73+
replicas in a single-node cluster. Unassigned shards will also be present if your
74+
cluster is red (since primaries are missing)
75+
76+
==== Drilling deeper: finding problematic indices
77+
78+
Imagine something goes wrong one day, and you notice that your cluster health
79+
looks like this:
80+
81+
[source,js]
82+
----
83+
{
84+
"cluster_name": "elasticsearch_zach",
85+
"status": "red",
86+
"timed_out": false,
87+
"number_of_nodes": 8,
88+
"number_of_data_nodes": 8,
89+
"active_primary_shards": 90,
90+
"active_shards": 180,
91+
"relocating_shards": 0,
92+
"initializing_shards": 0,
93+
"unassigned_shards": 20
94+
}
95+
----
96+
97+
Ok, so what can we deduce from this health status? Well, our cluster is Red,
98+
which means we are missing data (primary + replicas). We know our cluster has
99+
ten nodes, but only see 8 data nodes listed in the health. Two of our nodes
100+
have gone missing. We see that there are 20 unassigned shards.
101+
102+
That's about all the information we can glean. The nature of those missing
103+
shards are still a mystery. Are we missing 20 indices with one primary shard each?
104+
One index with 20 primary shards? Ten indices with one primary + one replica?
105+
Which index?
106+
107+
To answer these questions, we need to ask the Cluster Health for a little more
108+
information by using the `level` parameter.
109+
110+
[source,bash]
111+
----
112+
GET _cluster/health?level=indices
113+
----
114+
115+
This parameter will make the Cluster Health API to add a list of indices in our
116+
cluster and details about each of those indices (status, number of shards,
117+
unassigned shards, etc):
118+
119+
[source,js]
120+
----
121+
{
122+
"cluster_name": "elasticsearch_zach",
123+
"status": "red",
124+
"timed_out": false,
125+
"number_of_nodes": 8,
126+
"number_of_data_nodes": 8,
127+
"active_primary_shards": 90,
128+
"active_shards": 180,
129+
"relocating_shards": 0,
130+
"initializing_shards": 0,
131+
"unassigned_shards": 20
132+
"indices": {
133+
"v1": {
134+
"status": "green",
135+
"number_of_shards": 10,
136+
"number_of_replicas": 1,
137+
"active_primary_shards": 10,
138+
"active_shards": 20,
139+
"relocating_shards": 0,
140+
"initializing_shards": 0,
141+
"unassigned_shards": 0
142+
},
143+
"v2": {
144+
"status": "red", <1>
145+
"number_of_shards": 10,
146+
"number_of_replicas": 1,
147+
"active_primary_shards": 0,
148+
"active_shards": 0,
149+
"relocating_shards": 0,
150+
"initializing_shards": 0,
151+
"unassigned_shards": 20 <2>
152+
},
153+
"v3": {
154+
"status": "green",
155+
"number_of_shards": 10,
156+
"number_of_replicas": 1,
157+
"active_primary_shards": 10,
158+
"active_shards": 20,
159+
"relocating_shards": 0,
160+
"initializing_shards": 0,
161+
"unassigned_shards": 0
162+
},
163+
....
164+
}
165+
}
166+
----
167+
<1> We can now see that the `v2` index is the index which has made the cluster Red
168+
<2> And it becomes clear that all 20 missing shards are from this index
169+
170+
Once we ask for the indices output, it becomes immediately clear which index is
171+
having problems: the `v2` index. We also see that the index has 10 primary shards
172+
and one replica, and that all 20 shards are missing. Presumably these 20 shards
173+
were on the two nodes that are missing from our cluster.
174+
175+
The `level` parameter accepts one more option:
176+
177+
[source,bash]
178+
----
179+
GET _cluster/health?level=shards
180+
----
181+
182+
The `shards` option will provide a very verbose output, which lists the status
183+
and location of every shard inside every index. This output is sometimes useful,
184+
but due to the verbosity can difficult to work with. Once you know the index
185+
that is having problems, other APIs that we discuss in this chapter will tend
186+
to be more helpful.
187+
188+
==== Blocking for status changes
189+
190+
The Cluster Health API has another neat trick which is very useful when building
191+
unit and integration tests, or automated scripts that work with Elasticsearch.
192+
You can specify a `wait_for_status` parameter, which will make the call block
193+
until the status is satisfied. For example:
194+
195+
[source,bash]
196+
----
197+
GET _cluster/health?wait_for_status=green
198+
----
199+
200+
This call will block (e.g. not return control to your program) until the cluster
201+
health has turned green, meaning all primary + replica shards have been allocated.
202+
This is very important for automated scripts and tests.
203+
204+
If you create an index, Elasticsearch must broadcast the change in cluster state
205+
to all nodes. Those nodes must initialize those new shards, then respond to the
206+
master that the shards are Started. This process is very fast, but due to network
207+
latency may take 10-20ms.
208+
209+
If you have an automated script that A) creates an index and then B) immediately
210+
attempts to index a document, this operation may fail since the index has not
211+
been fully initialized yet. The time between A) and B) will likely be <1ms...
212+
not nearly enough time to account for network latency.
213+
214+
Rather than sleeping, just have your script/test call the cluster health with
215+
a `wait_for_status` parameter. As soon as the index is fully created, the cluster
216+
health will change to Green, the call returns control to your script, and you may
217+
begin indexing.
218+
219+
Valid options are `green`, `yellow` and `red`. The call will return when the
220+
requested status (or one "higher") is reached. E.g. if you request `yellow`,
221+
a status change to `yellow` or `green` will unblock the call.
222+

0 commit comments

Comments
 (0)