Skip to content

Improved Request Circuit Breaker #11070

Closed
@pickypg

Description

@pickypg

In certain circumstances, the request circuit breaker is not blocking requests that are individually fine, but holistically a problem. For example, if you have an aggregation on a very high-cardinality field and you allow the shard_size to become Integer.MAX_VALUE (either directly, or indirectly by setting it to 0), then you can create a lot of CPU and network congestion (this is documented behavior).

On a per-request basis, this may be caught and safely blocked. However, for requests that manage to sneak in under the request threshold, I have come across scenarios where I can have multiple in-flight requests that manage to crash the node that handles the request.

In particular, I have seen a client node forced into OOM conditions due parallel aggregations with a lot of shards:

/my-index/_search
{
  "aggs": {
    "agg-name":{
      "terms": {
        "field" : "high-cardinality"
        "size" : 10,
        "shard_size" : 0  // THIS IS THE CULPRIT!
      },
      "aggs": {
        // ...
      }
    }
}

In this case, an individual shard response was only ~70 MB, but there were many shards. Worse, other aggregations were in-flight at the same time. Eventually the memory became too much, causing the client node (in this case) to drop out due to OOM. I suspect that a similar problem could surface if a data node were forced to handle the initial request.

This is certainly not an easy problem to catch, nor will the solution to it be easy, but hopefully we can figure something out to combat the issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions