Skip to content

Very large scroll search (i.e. reindex) can gradually slow down  #65780

Closed
@jakelandis

Description

@jakelandis

Since 7.7 (via this PR) added better ability to cancel a search request. However, this resulted in adding a method to cancel a task to a collection on the context searcher. That collection is checked very frequently and the count of that collection can grow unbounded. The memory footprint is not an issue, rather the number of iterations for very long running scroll searches, such as used by re-index. In testing this started to show an issue around 50m documents and kept increasing the search latency as time went on.

Below is a test run of 180m documents being re-index that show the increase in the search latency and decrease in the search rate.

(7.9.1)
image

Hot threads will look similar to:

  2.9% (29.3ms out of 1s) cpu usage by thread 'elasticsearch[node1][search][T#93]'
     2/10 snapshots sharing following 20 elements
       app//org.elasticsearch.search.internal.ContextIndexSearcher$MutableQueryTimeout.checkCancelled(ContextIndexSearcher.java:357)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:196)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:185)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:343)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:298)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:150)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:485)
       app//org.elasticsearch.search.SearchService$$Lambda$5754/0x0000000801a8b040.get(Unknown Source)
       app//org.elasticsearch.search.SearchService$$Lambda$5270/0x0000000801a2d040.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$5092/0x00000008019a2840.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)

This issue is fixed as 7.10.0 due to #61062 and #46523 which will now re-create the searcher on each phase even for scroll requests. Which means that this collection will grow unbounded anymore. The same test above was run on 7.10.0 and did not show any signs of performance degradation.

For 7.7 -> 7.9.x there is an easy work around to for this issue:

PUT _cluster/settings
{
  "persistent": {
    "search.low_level_cancellation" : false
  }
}

Which will will prevent that collection from even being used. (also tested to fix the issue).

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed Indexing/ReindexIssues relating to reindex that are not caused by issues further down:Search/SearchSearch-related issues that do not fall into other categories>bugTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.Team:SearchMeta label for search team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions