Elasticsearch nodes run into OOM during sustained ThreadPoolRejections

**Elasticsearch version**: 1.7.3
**JVM version**: Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
**OS version**: Linux ip-XXXXXXX.ec2.internal 4.1.13-19.30.amzn1.x86_64 #1 SMP Fri Dec 11 03:42:10 -- UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
**Description of the problem including expected versus actual behavior**:
3 data nodes (7GB heap), 3 dedicated master nodes.
1 index with 128 primary shards, 2 replicas. Index has 600k documents.
Default configuration, with reduced search thread pool size (from default of 13 to 5 threads). We also changed the search queue size from 1000 to 999 (to verify that we could issue cluster level settings updates during the test).

**Steps to reproduce**:
We were hitting this issue in production and were able to successfully reproduce it in a staging environment by doing the following:
1. spin off 1000 threads running the following query concurrently (same query across all threads) in a tight loop

`POST index_name/object/_search
{
  "size" : 5000,
  "query" : {
    "filtered" : {
      "query" : {
        "filtered" : {
          "query" : {
            "query_string" : {
              "query" : "(field1:value1 AND NOT field2:_PREFIX_*)",
              "lowercase_expanded_terms" : false
            }
          },
          "filter" : {
            "and" : {
              "filters" : [ {
                "term" : {
                  "field3" : "value2"
                }
              }, {
                "term" : {
                  "type" : "MTS"
                }
              }, {
                "or" : {
                  "filters" : [ {
                    "term" : {
                      "active" : true
                    }
                  }, {
                    "and" : {
                      "filters" : [ {
                        "term" : {
                          "active" : false
                        }
                      }, {
                        "range" : {
                          "lastActive" : {
                            "from" : 0,
                            "to" : null,
                            "include_lower" : true,
                            "include_upper" : true
                          },
                          "_cache" : false
                        }
                      } ]
                    }
                  } ]
                }
              } ]
            }
          }
        }
      },
      "filter" : {
        "missing" : {
          "field" : "_deletedOnMs",
          "null_value" : true,
          "existence" : true
        }
      }
    }
  },
  "version" : false,
  "_source" : {
    "includes" : [ ],
    "excludes" : [ "_props" ]
  },
  "sort" : [ {
    "createdOnMs" : {
      "order" : "asc"
    }
  } ]
}`
1. those queries will rapidly fill up the search queue and we get thread pool rejections.
2. after 5mn, all 3 data nodes get into a zombie state (ie, OOM'ed) and get kicked out of the cluster

**Provide logs (if relevant)**:
thread dumps
https://gist.github.com/mahdibh/025d7a909475c43f9154e661c3ef839f
https://gist.github.com/mahdibh/b89a1d437467339a5c30eb416a96cfb5

node log
https://gist.github.com/mahdibh/320b515788400fb1560f2da1b1f2897f

**Notes**
All nodes in the cluster run into the same issue. We took a heap dump of one of the nodes when it was in this state. 41% of the shallow size of the heap is consumed by byte arrays. long[] comes next with 16% and then char[] with 8%. We could privately share the heap dump if it helps figuring out what's going on.

When this happens, the clusters becomes useless. Only the master nodes respond to API calls. The head plugin shows a blank list of nodes.

We can easily reproduce this internally, if there is anything we can do to provide more details, please let us know.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Elasticsearch nodes run into OOM during sustained ThreadPoolRejections #18230

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Elasticsearch nodes run into OOM during sustained ThreadPoolRejections #18230

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions