|
| 1 | + |
| 2 | +=== Cluster Health |
| 3 | + |
| 4 | +An Elasticsearch cluster may consist of a single node with a single index. Or it |
| 5 | +may have a hundred data nodes, three dedicated masters, a few dozen clients nodes |
| 6 | +...all operating on a thousand indices (and tends of thousands of shards). |
| 7 | + |
| 8 | +No matter the scale of the cluster, you'll want a quick way to assess the status |
| 9 | +of your cluster. The _Cluster Health_ API fills that role. You can think of it |
| 10 | +as a ten-thousand foot view of your cluster. It can reassure you that everything |
| 11 | +is alright, or alert you to a problem somewhere in your cluster. |
| 12 | + |
| 13 | +Let's execute a Health API and see what the response looks like: |
| 14 | + |
| 15 | +[source,bash] |
| 16 | +---- |
| 17 | +GET _cluster/health |
| 18 | +---- |
| 19 | + |
| 20 | +Like other APIs in Elasticsearch, Cluster Health will return a JSON response. |
| 21 | +This makes it convenient to parse for automation and alerting. The response |
| 22 | +contains some critical information about your cluster: |
| 23 | + |
| 24 | +[source,js] |
| 25 | +---- |
| 26 | +{ |
| 27 | + "cluster_name": "elasticsearch_zach", |
| 28 | + "status": "green", |
| 29 | + "timed_out": false, |
| 30 | + "number_of_nodes": 1, |
| 31 | + "number_of_data_nodes": 1, |
| 32 | + "active_primary_shards": 10, |
| 33 | + "active_shards": 10, |
| 34 | + "relocating_shards": 0, |
| 35 | + "initializing_shards": 0, |
| 36 | + "unassigned_shards": 0 |
| 37 | +} |
| 38 | +---- |
| 39 | + |
| 40 | +The most important piece of information in the response is the `"status"` field. |
| 41 | +The status may be one of three values: |
| 42 | + |
| 43 | +- *Green:* all primary and replica shards are allocated. Your cluster is 100% |
| 44 | +operational |
| 45 | +- *Yellow:* all primary shards are allocated, but at least one replica is missing. |
| 46 | +No data is missing so search results will still be complete. However, your |
| 47 | +high-availability is compromised to some degree. If _more_ shards disappear, you |
| 48 | +might lose data. Think of Yellow as a warning which should prompt investigation. |
| 49 | +- *Red:* at least one primary shard (and all of it's replicas) are missing. This means |
| 50 | +that you are missing data: searches will return partial results and indexing |
| 51 | +into that shard will return an exception. |
| 52 | + |
| 53 | +The Green/Yellow/Red status is a great way to glance at your cluster and understand |
| 54 | +what's going on. The rest of the metrics give you a general summary of your cluster: |
| 55 | + |
| 56 | +- `number_of_nodes` and `number_of_data_nodes` are fairly self-descriptive. |
| 57 | +- `active_primary_shards` are the number of primary shards in your cluster. This |
| 58 | +is an aggregate total across all indices. |
| 59 | +- `active_shards` is an aggregate total of _all_ shards across all indices, which |
| 60 | +includes replica shards |
| 61 | +- `relocating_shards` shows the number of shards that are currently moving from |
| 62 | +one node to another node. This number is often zero, but can increase when |
| 63 | +Elasticsearch decides a cluster is not properly balanced, a new node is added, |
| 64 | +a node is taken down, etc. |
| 65 | +- `initializing_shards` is a count of shards that are being freshly created. For |
| 66 | +example, when you first create an index, the shards will all briefly reside in |
| 67 | +"initializing" state. This is typically a transient event and shards shouldn't |
| 68 | +linger in "initializing" too long. You may also see initializing shards when a |
| 69 | +node is first restarted...as shards are loaded from disk they start as "initializing" |
| 70 | +- `unassigned_shards` are shards that exist in the cluster state, but cannot be |
| 71 | +found in the cluster itself. A common source of unassigned shards are unassigned |
| 72 | +replicas. For example, an index with 5 shards and 1 replica will have 5 unassigned |
| 73 | +replicas in a single-node cluster. Unassigned shards will also be present if your |
| 74 | +cluster is red (since primaries are missing) |
| 75 | + |
| 76 | +==== Drilling deeper: finding problematic indices |
| 77 | + |
| 78 | +Imagine something goes wrong one day, and you notice that your cluster health |
| 79 | +looks like this: |
| 80 | + |
| 81 | +[source,js] |
| 82 | +---- |
| 83 | +{ |
| 84 | + "cluster_name": "elasticsearch_zach", |
| 85 | + "status": "red", |
| 86 | + "timed_out": false, |
| 87 | + "number_of_nodes": 8, |
| 88 | + "number_of_data_nodes": 8, |
| 89 | + "active_primary_shards": 90, |
| 90 | + "active_shards": 180, |
| 91 | + "relocating_shards": 0, |
| 92 | + "initializing_shards": 0, |
| 93 | + "unassigned_shards": 20 |
| 94 | +} |
| 95 | +---- |
| 96 | + |
| 97 | +Ok, so what can we deduce from this health status? Well, our cluster is Red, |
| 98 | +which means we are missing data (primary + replicas). We know our cluster has |
| 99 | +ten nodes, but only see 8 data nodes listed in the health. Two of our nodes |
| 100 | +have gone missing. We see that there are 20 unassigned shards. |
| 101 | + |
| 102 | +That's about all the information we can glean. The nature of those missing |
| 103 | +shards are still a mystery. Are we missing 20 indices with one primary shard each? |
| 104 | +One index with 20 primary shards? Ten indices with one primary + one replica? |
| 105 | +Which index? |
| 106 | + |
| 107 | +To answer these questions, we need to ask the Cluster Health for a little more |
| 108 | +information by using the `level` parameter. |
| 109 | + |
| 110 | +[source,bash] |
| 111 | +---- |
| 112 | +GET _cluster/health?level=indices |
| 113 | +---- |
| 114 | + |
| 115 | +This parameter will make the Cluster Health API to add a list of indices in our |
| 116 | +cluster and details about each of those indices (status, number of shards, |
| 117 | +unassigned shards, etc): |
| 118 | + |
| 119 | +[source,js] |
| 120 | +---- |
| 121 | +{ |
| 122 | + "cluster_name": "elasticsearch_zach", |
| 123 | + "status": "red", |
| 124 | + "timed_out": false, |
| 125 | + "number_of_nodes": 8, |
| 126 | + "number_of_data_nodes": 8, |
| 127 | + "active_primary_shards": 90, |
| 128 | + "active_shards": 180, |
| 129 | + "relocating_shards": 0, |
| 130 | + "initializing_shards": 0, |
| 131 | + "unassigned_shards": 20 |
| 132 | + "indices": { |
| 133 | + "v1": { |
| 134 | + "status": "green", |
| 135 | + "number_of_shards": 10, |
| 136 | + "number_of_replicas": 1, |
| 137 | + "active_primary_shards": 10, |
| 138 | + "active_shards": 20, |
| 139 | + "relocating_shards": 0, |
| 140 | + "initializing_shards": 0, |
| 141 | + "unassigned_shards": 0 |
| 142 | + }, |
| 143 | + "v2": { |
| 144 | + "status": "red", <1> |
| 145 | + "number_of_shards": 10, |
| 146 | + "number_of_replicas": 1, |
| 147 | + "active_primary_shards": 0, |
| 148 | + "active_shards": 0, |
| 149 | + "relocating_shards": 0, |
| 150 | + "initializing_shards": 0, |
| 151 | + "unassigned_shards": 20 <2> |
| 152 | + }, |
| 153 | + "v3": { |
| 154 | + "status": "green", |
| 155 | + "number_of_shards": 10, |
| 156 | + "number_of_replicas": 1, |
| 157 | + "active_primary_shards": 10, |
| 158 | + "active_shards": 20, |
| 159 | + "relocating_shards": 0, |
| 160 | + "initializing_shards": 0, |
| 161 | + "unassigned_shards": 0 |
| 162 | + }, |
| 163 | + .... |
| 164 | + } |
| 165 | +} |
| 166 | +---- |
| 167 | +<1> We can now see that the `v2` index is the index which has made the cluster Red |
| 168 | +<2> And it becomes clear that all 20 missing shards are from this index |
| 169 | + |
| 170 | +Once we ask for the indices output, it becomes immediately clear which index is |
| 171 | +having problems: the `v2` index. We also see that the index has 10 primary shards |
| 172 | +and one replica, and that all 20 shards are missing. Presumably these 20 shards |
| 173 | +were on the two nodes that are missing from our cluster. |
| 174 | + |
| 175 | +The `level` parameter accepts one more option: |
| 176 | + |
| 177 | +[source,bash] |
| 178 | +---- |
| 179 | +GET _cluster/health?level=shards |
| 180 | +---- |
| 181 | + |
| 182 | +The `shards` option will provide a very verbose output, which lists the status |
| 183 | +and location of every shard inside every index. This output is sometimes useful, |
| 184 | +but due to the verbosity can difficult to work with. Once you know the index |
| 185 | +that is having problems, other APIs that we discuss in this chapter will tend |
| 186 | +to be more helpful. |
| 187 | + |
| 188 | +==== Blocking for status changes |
| 189 | + |
| 190 | +The Cluster Health API has another neat trick which is very useful when building |
| 191 | +unit and integration tests, or automated scripts that work with Elasticsearch. |
| 192 | +You can specify a `wait_for_status` parameter, which will make the call block |
| 193 | +until the status is satisfied. For example: |
| 194 | + |
| 195 | +[source,bash] |
| 196 | +---- |
| 197 | +GET _cluster/health?wait_for_status=green |
| 198 | +---- |
| 199 | + |
| 200 | +This call will block (e.g. not return control to your program) until the cluster |
| 201 | +health has turned green, meaning all primary + replica shards have been allocated. |
| 202 | +This is very important for automated scripts and tests. |
| 203 | + |
| 204 | +If you create an index, Elasticsearch must broadcast the change in cluster state |
| 205 | +to all nodes. Those nodes must initialize those new shards, then respond to the |
| 206 | +master that the shards are Started. This process is very fast, but due to network |
| 207 | +latency may take 10-20ms. |
| 208 | + |
| 209 | +If you have an automated script that A) creates an index and then B) immediately |
| 210 | +attempts to index a document, this operation may fail since the index has not |
| 211 | +been fully initialized yet. The time between A) and B) will likely be <1ms... |
| 212 | +not nearly enough time to account for network latency. |
| 213 | + |
| 214 | +Rather than sleeping, just have your script/test call the cluster health with |
| 215 | +a `wait_for_status` parameter. As soon as the index is fully created, the cluster |
| 216 | +health will change to Green, the call returns control to your script, and you may |
| 217 | +begin indexing. |
| 218 | + |
| 219 | +Valid options are `green`, `yellow` and `red`. The call will return when the |
| 220 | +requested status (or one "higher") is reached. E.g. if you request `yellow`, |
| 221 | +a status change to `yellow` or `green` will unblock the call. |
| 222 | + |
0 commit comments