Description
Since 7.7 (via this PR) added better ability to cancel a search request. However, this resulted in adding a method to cancel a task to a collection on the context searcher. That collection is checked very frequently and the count of that collection can grow unbounded. The memory footprint is not an issue, rather the number of iterations for very long running scroll searches, such as used by re-index. In testing this started to show an issue around 50m documents and kept increasing the search latency as time went on.
Below is a test run of 180m documents being re-index that show the increase in the search latency and decrease in the search rate.
Hot threads will look similar to:
2.9% (29.3ms out of 1s) cpu usage by thread 'elasticsearch[node1][search][T#93]'
2/10 snapshots sharing following 20 elements
app//org.elasticsearch.search.internal.ContextIndexSearcher$MutableQueryTimeout.checkCancelled(ContextIndexSearcher.java:357)
app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:196)
app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:185)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:343)
app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:298)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:150)
app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:485)
app//org.elasticsearch.search.SearchService$$Lambda$5754/0x0000000801a8b040.get(Unknown Source)
app//org.elasticsearch.search.SearchService$$Lambda$5270/0x0000000801a2d040.get(Unknown Source)
app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)
app//org.elasticsearch.action.ActionRunnable$$Lambda$5092/0x00000008019a2840.accept(Unknown Source)
app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
java.base@14.0.1/java.lang.Thread.run(Thread.java:832)
This issue is fixed as 7.10.0 due to #61062 and #46523 which will now re-create the searcher on each phase even for scroll requests. Which means that this collection will grow unbounded anymore. The same test above was run on 7.10.0 and did not show any signs of performance degradation.
For 7.7 -> 7.9.x there is an easy work around to for this issue:
PUT _cluster/settings
{
"persistent": {
"search.low_level_cancellation" : false
}
}
Which will will prevent that collection from even being used. (also tested to fix the issue).