Improved Request Circuit Breaker

In certain circumstances, the request circuit breaker is not blocking requests that are individually fine, but holistically a problem. For example, if you have an aggregation on a very high-cardinality field _and_ you allow the `shard_size` to become `Integer.MAX_VALUE` (either directly, or indirectly by setting it to `0`), then you can create a lot of CPU and network congestion (this is documented behavior).

On a per-request basis, this may be caught and safely blocked. However, for requests that manage to sneak in under the request threshold, I have come across scenarios where I can have multiple in-flight requests that manage to crash the node that handles the request.

In particular, I have seen a client node forced into OOM conditions due parallel aggregations with a lot of shards:

```
/my-index/_search
{
  "aggs": {
    "agg-name":{
      "terms": {
        "field" : "high-cardinality"
        "size" : 10,
        "shard_size" : 0  // THIS IS THE CULPRIT!
      },
      "aggs": {
        // ...
      }
    }
}
```

In this case, an individual shard response was _only_ ~70 MB, but there were many shards. Worse, other aggregations were in-flight at the same time. Eventually the memory became too much, causing the client node (in this case) to drop out due to OOM. I suspect that a similar problem could surface if a data node were forced to handle the initial request.

This is certainly not an easy problem to catch, nor will the solution to it be easy, but hopefully we can figure something out to combat the issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved Request Circuit Breaker #11070

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improved Request Circuit Breaker #11070

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions