Skip to content

[Intra-SegmentConcurrentSearch] Change Collectors and CollectorManagers #18854

@expani

Description

@expani

Is your feature request related to a problem? Please describe

All the collectors written in OpenSearch today follow the assumption that the getLeafCollector(LeafReaderContext) method will only be called once per segment.

The code flow to search a LeafPartition ( can be an entire segment OR a partition of a segment ) during ConcurrentSegmentSearch is as follows:

-------- [OpenSearch] --------------
ConcurrentQueryPhaseSearcher#searchWithCollectorManager
          < calls >
-------- [Lucene] --------------
IndexSearcher#search(Query query, CollectorManager<C, T> collectorManager)
          < calls >
IndexSearcher#search(Weight weight, CollectorManager<C, T> collectorManager, C firstCollector)
          < submits a runnable per Slice to thread pool > 
          < Runnable when executed calls >
-------- [OpenSearch] --------------
ContextIndexSearcher#search(LeafReaderContextPartition[] partitions, Weight weight, Collector collector)
          < The LeafReaderContextPartition[] is a leaf slice and executed by a single thread > 
          < Goes over every leaf partition and calls >
ContextIndexSearcher#searchLeaf(LeafReaderContext ctx, int minDocId, int maxDocId, Weight weight, Collector collector)

Within searchLeaf the core operations of executing a query is performed :

  • Collector#getLeafCollector() : Fetches a leaf collector for the segment partition.
  • Weight#bulkScorer() : Fetch the scorer for this segment partition.
  • BulkScorer#score(leafCollector, LiveDocs, minDocId, maxDocId) : Collect all matching docIds from the given docId range.

With IntraSegmentConcurrentSearch, ContextIndexSearcher#searchLeaf can be called by 2 threads for the same segment but for a different docId range.

This means the assumption that getLeafCollector(LeafReaderContext) method will only be called once per segment IS BROKEN.

This task will capture all the collectors that will be broken due to the assumption and must be changed for IntraSegmentConcurrentSearch.

Meta #18852

Describe the solution you'd like

Listing down all broken Collectors/CollectorManagers due to IntraSegmentConcurrentSearch.

FilteredCollector

Used when post_filter in a search request is present and is set by the QueryPhase

This will end up getting matching docs based on post filter multiple times for a segment which needs to be handled.

TotalHitCountCollectorManager

When 2 threads process different partitions of the same segment concurrently, each of them will collect hit counts for the case where the query matches can be found quickly ( a.k.a weight#count() != -1 )

The total hit count will be duplicated during reducing results from different collectors leading to wrong results.

Lucene also introduced this collector manager some time back and it was changed specifically to handle IntraSegmentConcurrentSearch recently

We need to do the same OR find ways to migrate towards Lucene's version of the CollectorManager.

EarlyTerminatingCollectorManager

EarlyTerminatingCollector is used in different cases as follows :

  1. When terminate_after is specified, we don't need to handle as Concurrent Segment Search is disabled

  2. When track_total_hits is less than Integer.MAX_VALUE but not zero.

  3. In all other cases, it terminates during calls to getLeafCollector as the number of hits after which it should terminate is 0.

With ConcurrentSegmentSearch, we don't accurately implement trackTotalHits as it becomes per LeafSlice instead of per Shard, so we end up collecting trackTotalHits * MaxLeafSliceCount. The same behavior will be continued with IntraSegmentConcurrentSearch and making it more accurate can be picked up as a separate issue.

Additional context

Custom collectors in the OpenSearch Project

https://github.com/search?q=org%3Aopensearch-project+%22implements+Collector+%22&type=code

Custom collector managers in the OpenSearch Project

https://github.com/search?q=org%3Aopensearch-project+%22implements%22+%22CollectorManager%22&type=code

Metadata

Metadata

Assignees

Labels

SearchSearch query, autocomplete ...etcenhancementEnhancement or improvement to existing feature or requestlucene

Type

No type

Projects

Status

🆕 New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions