You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add support for intra-segment search concurrency (#13542)
This commit introduces support for optionally creating slices that target leaf reader context partitions, which allow them to be searched concurrently. This is good to maximize resource usage when searching force-merged indices, or indices with rather big segments, by parallelizig search execution across subsets of segments being searched.
Note: this commit does not affect default generation of slices. Segments can be partitioned by overriding the `IndexSearcher#slices(List<LeafReaderContext>)` method to plug in ad-hoc slices creation. Moreover, the existing `IndexSearcher#slices` static method now creates segment partitions when the additional `allowSegmentsPartitions` argument is set to `true`.
The overall design of this change is based on the existing search concurrency support that is based on `LeafSlice` and `CollectorManager`. A new `LeafReaderContextPartition` abstraction is introduced, that holds a reference to a `LeafReaderContext` and the range of doc ids it targets. A `LeafSlice` noew targets segment partitions, each identified by a `LeafReaderContext` instance and a range of doc ids. It is possible for a partition to target a whole segment, and for partitions of different segments to be combined into the same leaf slices freely, hence searched by the same thread. It is not possible for multiple partitions of the same segment to be added to the same leaf slice.
Segment partitions are searched concurrently leveraging the existing `BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int max)` method, that allows to score a specific subset of documents for a provided
`LeafCollector`, in place of the `BulkScorer#score(LeafCollector collector, Bits acceptDocs)` that would instead score all documents.
## Changes that require migration
The migrate guide has the following new clarifying items around the contract and breaking changes required to support intra-segment concurrency:
- `Collector#getLeafCollector` may be called multiple times for the same leaf across distinct `Collector` instances created by a `CollectorManager`. Logic that relies on `getLeafCollector` being called once per leaf per search needs updating.
- a `Scorer`, `ScorerSupplier` or `BulkScorer` may be requested multiple times for the same leaf
- `IndexSearcher#searchLeaf` change of signature to accept the range of doc ids
- `BulkScorer#score(LeafCollector, BitSet)` is removed in favour of `BulkScorer#score(LeafCollector, BitSet, int, int)`
- static `IndexSearcher#slices` method changed to take a last boolean argument that optionally enables the creation of segment partitions
- `TotalHitCountCollectorManager` now requires that an array of `LeafSlice`s, retrieved via `IndexSearcher#getSlices`,
is provided to its constructor
Note: `DrillSideways` is the only component that does not support intra-segment concurrency and needs considerable work to do so, due to its requirement that the entire set of docs in a segment gets scored in one go.
The default searcher slicing is not affected by this PR, but `LuceneTestCase` now randomly leverages intra-segment concurrency. An additional `newSearcher` method is added that takes a `Concurrency` enum as the last argument in place of the `useThreads` boolean flag. This is important to disable intra-segment concurrency for `DrillSideways` related tests that do support inter-segment concurrency but not intra-segment concurrency.
## Next step
While this change introduces support for intra-segment concurrency, it only sets up the foundations of it. There is still a performance penalty for queries that require segment-level computation ahead of time, such as points/range queries. This is an implementation limitation that we expect to improve in future releases, see #13745.
Additionally, we will need to decide what to do about the lack of support for intra-segment concurrency in `DrillSideways` before we can enable intra-segment slicing by default. See #13753 .
Closes#9721
Use `BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int max)` instead. In order to score the
825
+
entire leaf, provide `0` as min and `DocIdSetIterator.NO_MORE_DOCS` as max. `BulkScorer` subclasses that override
826
+
such method need to instead override the method variant that takes the range of doc ids as well as arguments.
827
+
828
+
### CollectorManager#newCollector and Collector#getLeafCollector contract
829
+
830
+
With the introduction of intra-segment query concurrency support, multiple `LeafCollector`s may be requested for the
831
+
same `LeafReaderContext` via `Collector#getLeafCollector(LeafReaderContext)` across the different `Collector` instances
832
+
returned by multiple `CollectorManager#newCollector` calls. Any logic or computation that needs to happen
833
+
once per segment requires specific handling in the collector manager implementation. See `TotalHitCountCollectorManager`
834
+
as an example. Individual collectors don't need to be adapted as a specific `Collector` instance will still see a given
835
+
`LeafReaderContext` once, given that it is not possible to add more than one partition of the same segment to the same
836
+
leaf slice.
837
+
838
+
### Weight#scorer, Weight#bulkScorer and Weight#scorerSupplier contract
839
+
840
+
With the introduction of intra-segment query concurrency support, multiple `Scorer`s, `ScorerSupplier`s or `BulkScorer`s
841
+
may be requested for the same `LeafReaderContext` instance as part of a single search call. That may happen concurrently
842
+
from separate threads each searching a specific doc id range of the segment. `Weight` implementations that rely on the
843
+
assumption that a scorer, bulk scorer or scorer supplier for a given `LeafReaderContext` is requested once per search
844
+
need updating.
845
+
846
+
### Signature of IndexSearcher#searchLeaf changed
847
+
848
+
With the introduction of intra-segment query concurrency support, the `IndexSearcher#searchLeaf(LeafReaderContext ctx, Weight weight, Collector collector)`
849
+
method now accepts two additional int arguments to identify the min/max range of doc ids that will be searched in this
850
+
leaf partition`: IndexSearcher#searchLeaf(LeafReaderContext ctx, int minDocId, int maxDocId, Weight weight, Collector collector)`.
851
+
Subclasses of `IndexSearcher` that call or override the `searchLeaf` method need to be updated accordingly.
852
+
853
+
### Signature of static IndexSearch#slices method changed
854
+
855
+
The static `IndexSearcher#sslices(List<LeafReaderContext> leaves, int maxDocsPerSlice, int maxSegmentsPerSlice)`
856
+
method now supports an additional 4th and last argument to optionally enable creating segment partitions:
857
+
`IndexSearcher#slices(List<LeafReaderContext> leaves, int maxDocsPerSlice, int maxSegmentsPerSlice, boolean allowSegmentPartitions)`
858
+
859
+
### TotalHitCountCollectorManager constructor
860
+
861
+
`TotalHitCountCollectorManager` now requires that an array of `LeafSlice`s, retrieved via `IndexSearcher#getSlices`,
862
+
is provided to its constructor. Depending on whether segment partitions are present among slices, the manager can
863
+
optimize the type of collectors it creates and exposes via `newCollector`.
0 commit comments