Improve pending indexing metrics and back pressure

Currently indexing back pressure is limited to the size of the write queue. This does not effectively reflect the amount of outstanding indexing work for a node. We would like to add new mechanisms which better reflect the amount of outstanding work.

## Target 7.9

### Indexing metrics and back pressure

In 7.9 we are adding metrics about the number of indexing request bytes outstanding at each point in the indexing process (coordinating, primary, and replication). These metrics will be exposed in the node stats API. Additionally, we will introduce a new setting `indexing_pressure.memory.limit` which allows a maximum number of bytes to be outstanding. This setting will be 10% of the heap by default. Once 10% of a node's heap is consumed by outstanding indexing bytes, we will start rejecting new coordinating and primary requests.

Additionally, since a failed replication operation can fail a replica, we will assign 1.5X limit for the number of replication bytes. Additionally, only replication bytes can trigger this limit. So if replication bytes increase to high levels, the node will stop accepting new coordinating and primary operations until the replication work load has dropped.

- [x] Add metrics about outstanding indexing request bytes at the coordinating, primary, and replica levels (bulk shard, bulk operations, resync, etc) (#57573)
- [x] Use metrics to reject new operations when too many bytes are outstanding (#58885)
- [x] Expose metrics to node stats API (#59247)
- [x] Add tests at the REST layer (#59247)
- [x] Increase write queue size (#59464)
- [x] Update release docs with write queue size change (#59464)
- [x] Consider separating coordinating and primary bytes at the metrics level (#59487)
- [x] Bring #59464 to 8.0 without release docs (#59559)
- [x] Add documentation (#59456)
- [x] Add unit level tests to ensure metrics and rejection logic is correct (#60150)


#### 7.9 Node stats API with human readable enabled

```
      "indexing_pressure": {
        "memory": {
          "current": {
            "combined_coordinating_and_primary": "0b",
            "combined_coordinating_and_primary_in_bytes": 0,
            "coordinating": "0b",
            "coordinating_in_bytes": 0,
            "primary": "0b",
            "primary_in_bytes": 0,
            "replica": "0b",
            "replica_in_bytes": 0,
            "all": "0b",
            "all_in_bytes": 0
          },
          "total": {
            "combined_coordinating_and_primary": "8.1kb",
            "combined_coordinating_and_primary_in_bytes": 8325,
            "coordinating": "8.1kb",
            "coordinating_in_bytes": 8325,
            "primary": "10.4kb",
            "primary_in_bytes": 10725,
            "replica": "0b",
            "replica_in_bytes": 0,
            "all": "8.1kb",
            "all_in_bytes": 8325,
            "coordinating_rejections": 0,
            "primary_rejections": 0,
            "replica_rejections": 0
          }
        }
      }
```

### Replication Retries

In order to mitigate the potential of transient disruptions failing a replica, we will enable replication retries at the primary level. When an operation fails because of connection error, circuit breaking, rejected, etc we the primary will retry until the new timeout setting is exhausted (`indices.replication.retry_timeout`).

- [x] Enable replication retries (#55633)
- [ ] Add documentation.

## Target 7.10

- [ ] Evaluate mechanisms for back presssure related to the CPU cost of indexing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pending indexing metrics and back pressure #59263

Tim-Brooks
openedon Jul 8, 2020

Target 7.9

Indexing metrics and back pressure

7.9 Node stats API with human readable enabled

Replication Retries

Target 7.10

Assignees

Labels

Type

Projects

Milestone

Relationships

Development