Skip to content

SearchShardFailure handling results in OOM #48910

Closed
@altinp

Description

@altinp

This is either the same or closely related to #27596 reported by @mattweber - which was hoped fixed in 6.1+ but was closed for lack of feedback in the end. It likely was not fixed.

  • ES 6.6.2 (and 5.6.x)
  • latest OpenJDK 1.8
  • CentOS 7
  • ES heap 31GB
  • log-level: info
    (This is at a client site and I don't have direct access to some details but can obtain if really needed.)

Problem is reproducible in various clusters, ranging from 8 to 50+ nodes, with thousands of shards. As long as a query fails (or more precisely is rejected) on each shard, e.g. because it trips too_many_clauses, an OOM will occur eventually when:
free_heap < 2 * (serialized_query_string_size) * num_shards
where serialized_query_string_size is at least as big as the user-submitted query.

So this is a case of the remedy being worse than the disease: in ES 6.6, the terms query now allows up to 64K terms (hence eventual clauses) by default, finally in line with the default terms-lookup limit, but if you write the query with (1K < separate term clauses < 64K), it's the protective mechanism that kills the coordinating node (just as in v 5.6 with a terms query with >1K terms)

I have analyzed the heap dumps and can provide reports. The reason seems to be that ShardSearchFailure holds on to two error strings that contain the whole pretty-printed query. One inside the cause [QueryShardException] and the other in the reason. In our ES 5.6 test, these strings were about the same size as the user query; in the ES 6.6 test, a simple 4MB user query (on-disk-size) such as:

{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "from.email.keyword": "foo1"
          }
        }, 
        {
          "term": {
            "from.email.keyword": "foo2"
          }
...

led to a 11.5 MB in-heap size for cause and reason (each)

2 *11.5MB * 1061 shards ~= 24 GB => OOM

Questions:

  • if the query trips a global limit, can't this be short-circuited at the query-coordinating node before sending it out to all data nodes (at least if that's not a coordinating-only node?). If not, detect that one of the incoming failures is of a global-tripwire type and drop the rest?
  • trying to return more than the first x bytes of the user query is clearly not useful
  • we have users and automated clients that can always submit (or programmatically generate!) queries with too many term clauses etc.

I will next submit screenshots and reports from the heap analysis.

NOTE: I don't think the issue is specific to logging the error, but to preparing a query response to return to the client. This contains at least one shard failure per index as seen in a response when the query is limited to a small number of indexes/shards and returns without OOM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions