Skip to content

Elasticsearch nodes run into OOM during sustained ThreadPoolRejections #18230

Closed
@mahdibh

Description

@mahdibh

Elasticsearch version: 1.7.3
JVM version: Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
OS version: Linux ip-XXXXXXX.ec2.internal 4.1.13-19.30.amzn1.x86_64 #1 SMP Fri Dec 11 03:42:10 -- UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
3 data nodes (7GB heap), 3 dedicated master nodes.
1 index with 128 primary shards, 2 replicas. Index has 600k documents.
Default configuration, with reduced search thread pool size (from default of 13 to 5 threads). We also changed the search queue size from 1000 to 999 (to verify that we could issue cluster level settings updates during the test).

Steps to reproduce:
We were hitting this issue in production and were able to successfully reproduce it in a staging environment by doing the following:

  1. spin off 1000 threads running the following query concurrently (same query across all threads) in a tight loop

POST index_name/object/_search { "size" : 5000, "query" : { "filtered" : { "query" : { "filtered" : { "query" : { "query_string" : { "query" : "(field1:value1 AND NOT field2:_PREFIX_*)", "lowercase_expanded_terms" : false } }, "filter" : { "and" : { "filters" : [ { "term" : { "field3" : "value2" } }, { "term" : { "type" : "MTS" } }, { "or" : { "filters" : [ { "term" : { "active" : true } }, { "and" : { "filters" : [ { "term" : { "active" : false } }, { "range" : { "lastActive" : { "from" : 0, "to" : null, "include_lower" : true, "include_upper" : true }, "_cache" : false } } ] } } ] } } ] } } } }, "filter" : { "missing" : { "field" : "_deletedOnMs", "null_value" : true, "existence" : true } } } }, "version" : false, "_source" : { "includes" : [ ], "excludes" : [ "_props" ] }, "sort" : [ { "createdOnMs" : { "order" : "asc" } } ] }

  1. those queries will rapidly fill up the search queue and we get thread pool rejections.
  2. after 5mn, all 3 data nodes get into a zombie state (ie, OOM'ed) and get kicked out of the cluster

Provide logs (if relevant):
thread dumps
https://gist.github.com/mahdibh/025d7a909475c43f9154e661c3ef839f
https://gist.github.com/mahdibh/b89a1d437467339a5c30eb416a96cfb5

node log
https://gist.github.com/mahdibh/320b515788400fb1560f2da1b1f2897f

Notes
All nodes in the cluster run into the same issue. We took a heap dump of one of the nodes when it was in this state. 41% of the shallow size of the heap is consumed by byte arrays. long[] comes next with 16% and then char[] with 8%. We could privately share the heap dump if it helps figuring out what's going on.

When this happens, the clusters becomes useless. Only the master nodes respond to API calls. The head plugin shows a blank list of nodes.

We can easily reproduce this internally, if there is anything we can do to provide more details, please let us know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions