Description
This is either the same or closely related to #27596 reported by @mattweber - which was hoped fixed in 6.1+ but was closed for lack of feedback in the end. It likely was not fixed.
- ES 6.6.2 (and 5.6.x)
- latest OpenJDK 1.8
- CentOS 7
- ES heap 31GB
- log-level: info
(This is at a client site and I don't have direct access to some details but can obtain if really needed.)
Problem is reproducible in various clusters, ranging from 8 to 50+ nodes, with thousands of shards. As long as a query fails (or more precisely is rejected) on each shard, e.g. because it trips too_many_clauses, an OOM will occur eventually when:
free_heap < 2 * (serialized_query_string_size) * num_shards
where serialized_query_string_size
is at least as big as the user-submitted query.
So this is a case of the remedy being worse than the disease: in ES 6.6, the terms query now allows up to 64K terms (hence eventual clauses) by default, finally in line with the default terms-lookup limit, but if you write the query with (1K < separate term clauses < 64K), it's the protective mechanism that kills the coordinating node (just as in v 5.6 with a terms query with >1K terms)
I have analyzed the heap dumps and can provide reports. The reason seems to be that ShardSearchFailure
holds on to two error strings that contain the whole pretty-printed query. One inside the cause [QueryShardException]
and the other in the reason
. In our ES 5.6 test, these strings were about the same size as the user query; in the ES 6.6 test, a simple 4MB user query (on-disk-size) such as:
{
"query": {
"bool": {
"should": [
{
"term": {
"from.email.keyword": "foo1"
}
},
{
"term": {
"from.email.keyword": "foo2"
}
...
led to a 11.5 MB in-heap size for cause
and reason
(each)
2 *11.5MB * 1061 shards ~= 24 GB => OOM
Questions:
- if the query trips a global limit, can't this be short-circuited at the query-coordinating node before sending it out to all data nodes (at least if that's not a coordinating-only node?). If not, detect that one of the incoming failures is of a global-tripwire type and drop the rest?
- trying to return more than the first x bytes of the user query is clearly not useful
- we have users and automated clients that can always submit (or programmatically generate!) queries with too many term clauses etc.
I will next submit screenshots and reports from the heap analysis.
NOTE: I don't think the issue is specific to logging the error, but to preparing a query response to return to the client. This contains at least one shard failure per index as seen in a response when the query is limited to a small number of indexes/shards and returns without OOM.