Skip to content

Commit fafd6af

Browse files
authored
Add support for intra-segment search concurrency (#13542)
This commit introduces support for optionally creating slices that target leaf reader context partitions, which allow them to be searched concurrently. This is good to maximize resource usage when searching force-merged indices, or indices with rather big segments, by parallelizig search execution across subsets of segments being searched. Note: this commit does not affect default generation of slices. Segments can be partitioned by overriding the `IndexSearcher#slices(List<LeafReaderContext>)` method to plug in ad-hoc slices creation. Moreover, the existing `IndexSearcher#slices` static method now creates segment partitions when the additional `allowSegmentsPartitions` argument is set to `true`. The overall design of this change is based on the existing search concurrency support that is based on `LeafSlice` and `CollectorManager`. A new `LeafReaderContextPartition` abstraction is introduced, that holds a reference to a `LeafReaderContext` and the range of doc ids it targets. A `LeafSlice` noew targets segment partitions, each identified by a `LeafReaderContext` instance and a range of doc ids. It is possible for a partition to target a whole segment, and for partitions of different segments to be combined into the same leaf slices freely, hence searched by the same thread. It is not possible for multiple partitions of the same segment to be added to the same leaf slice. Segment partitions are searched concurrently leveraging the existing `BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int max)` method, that allows to score a specific subset of documents for a provided `LeafCollector`, in place of the `BulkScorer#score(LeafCollector collector, Bits acceptDocs)` that would instead score all documents. ## Changes that require migration The migrate guide has the following new clarifying items around the contract and breaking changes required to support intra-segment concurrency: - `Collector#getLeafCollector` may be called multiple times for the same leaf across distinct `Collector` instances created by a `CollectorManager`. Logic that relies on `getLeafCollector` being called once per leaf per search needs updating. - a `Scorer`, `ScorerSupplier` or `BulkScorer` may be requested multiple times for the same leaf - `IndexSearcher#searchLeaf` change of signature to accept the range of doc ids - `BulkScorer#score(LeafCollector, BitSet)` is removed in favour of `BulkScorer#score(LeafCollector, BitSet, int, int)` - static `IndexSearcher#slices` method changed to take a last boolean argument that optionally enables the creation of segment partitions - `TotalHitCountCollectorManager` now requires that an array of `LeafSlice`s, retrieved via `IndexSearcher#getSlices`, is provided to its constructor Note: `DrillSideways` is the only component that does not support intra-segment concurrency and needs considerable work to do so, due to its requirement that the entire set of docs in a segment gets scored in one go. The default searcher slicing is not affected by this PR, but `LuceneTestCase` now randomly leverages intra-segment concurrency. An additional `newSearcher` method is added that takes a `Concurrency` enum as the last argument in place of the `useThreads` boolean flag. This is important to disable intra-segment concurrency for `DrillSideways` related tests that do support inter-segment concurrency but not intra-segment concurrency. ## Next step While this change introduces support for intra-segment concurrency, it only sets up the foundations of it. There is still a performance penalty for queries that require segment-level computation ahead of time, such as points/range queries. This is an implementation limitation that we expect to improve in future releases, see #13745. Additionally, we will need to decide what to do about the lack of support for intra-segment concurrency in `DrillSideways` before we can enable intra-segment slicing by default. See #13753 . Closes #9721
1 parent 942065c commit fafd6af

37 files changed

+962
-216
lines changed

lucene/CHANGES.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,14 @@ New Features
151151
* GITHUB#13592: Take advantage of the doc value skipper when it is primary sort in SortedNumericDocValuesRangeQuery
152152
and SortedSetDocValuesRangeQuery. (Ignacio Vera)
153153

154+
* GITHUB#13542: Add initial support for intra-segment concurrency. IndexSearcher now supports searching across leaf
155+
reader partitions concurrently. This is useful to max out available resource usage especially with force merged
156+
indices or big segments. There is still a performance penalty for queries that require segment-level computation
157+
ahead of time, such as points/range queries. This is an implementation limitation that we expect to improve in
158+
future releases, ad that's why intra-segment slicing is not enabled by default, but leveraged in tests when the
159+
searcher is created via LuceneTestCase#newSearcher. Users may override IndexSearcher#slices(List) to optionally
160+
create slices that target segment partitions. (Luca Cavanna)
161+
154162
Improvements
155163
---------------------
156164

lucene/MIGRATE.md

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -816,5 +816,48 @@ both `TopDocs` as well as facets results included in a reduced `FacetsCollector`
816816

817817
### `SearchWithCollectorTask` no longer supports the `collector.class` config parameter
818818

819-
`collector.class` used to allow users to load a custom collector implementation. `collector.manager.class`
820-
replaces it by allowing users to load a custom collector manager instead.
819+
`collector.class` used to allow users to load a custom collector implementation. `collector.manager.class`
820+
replaces it by allowing users to load a custom collector manager instead.
821+
822+
### BulkScorer#score(LeafCollector collector, Bits acceptDocs) removed
823+
824+
Use `BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int max)` instead. In order to score the
825+
entire leaf, provide `0` as min and `DocIdSetIterator.NO_MORE_DOCS` as max. `BulkScorer` subclasses that override
826+
such method need to instead override the method variant that takes the range of doc ids as well as arguments.
827+
828+
### CollectorManager#newCollector and Collector#getLeafCollector contract
829+
830+
With the introduction of intra-segment query concurrency support, multiple `LeafCollector`s may be requested for the
831+
same `LeafReaderContext` via `Collector#getLeafCollector(LeafReaderContext)` across the different `Collector` instances
832+
returned by multiple `CollectorManager#newCollector` calls. Any logic or computation that needs to happen
833+
once per segment requires specific handling in the collector manager implementation. See `TotalHitCountCollectorManager`
834+
as an example. Individual collectors don't need to be adapted as a specific `Collector` instance will still see a given
835+
`LeafReaderContext` once, given that it is not possible to add more than one partition of the same segment to the same
836+
leaf slice.
837+
838+
### Weight#scorer, Weight#bulkScorer and Weight#scorerSupplier contract
839+
840+
With the introduction of intra-segment query concurrency support, multiple `Scorer`s, `ScorerSupplier`s or `BulkScorer`s
841+
may be requested for the same `LeafReaderContext` instance as part of a single search call. That may happen concurrently
842+
from separate threads each searching a specific doc id range of the segment. `Weight` implementations that rely on the
843+
assumption that a scorer, bulk scorer or scorer supplier for a given `LeafReaderContext` is requested once per search
844+
need updating.
845+
846+
### Signature of IndexSearcher#searchLeaf changed
847+
848+
With the introduction of intra-segment query concurrency support, the `IndexSearcher#searchLeaf(LeafReaderContext ctx, Weight weight, Collector collector)`
849+
method now accepts two additional int arguments to identify the min/max range of doc ids that will be searched in this
850+
leaf partition`: IndexSearcher#searchLeaf(LeafReaderContext ctx, int minDocId, int maxDocId, Weight weight, Collector collector)`.
851+
Subclasses of `IndexSearcher` that call or override the `searchLeaf` method need to be updated accordingly.
852+
853+
### Signature of static IndexSearch#slices method changed
854+
855+
The static `IndexSearcher#sslices(List<LeafReaderContext> leaves, int maxDocsPerSlice, int maxSegmentsPerSlice)`
856+
method now supports an additional 4th and last argument to optionally enable creating segment partitions:
857+
`IndexSearcher#slices(List<LeafReaderContext> leaves, int maxDocsPerSlice, int maxSegmentsPerSlice, boolean allowSegmentPartitions)`
858+
859+
### TotalHitCountCollectorManager constructor
860+
861+
`TotalHitCountCollectorManager` now requires that an array of `LeafSlice`s, retrieved via `IndexSearcher#getSlices`,
862+
is provided to its constructor. Depending on whether segment partitions are present among slices, the manager can
863+
optimize the type of collectors it creates and exposes via `newCollector`.

lucene/core/src/java/org/apache/lucene/search/BulkScorer.java

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -27,18 +27,6 @@
2727
*/
2828
public abstract class BulkScorer {
2929

30-
/**
31-
* Scores and collects all matching documents.
32-
*
33-
* @param collector The collector to which all matching documents are passed.
34-
* @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null}
35-
* if they are all allowed to match.
36-
*/
37-
public void score(LeafCollector collector, Bits acceptDocs) throws IOException {
38-
final int next = score(collector, acceptDocs, 0, DocIdSetIterator.NO_MORE_DOCS);
39-
assert next == DocIdSetIterator.NO_MORE_DOCS;
40-
}
41-
4230
/**
4331
* Collects matching documents in a range and return an estimation of the next matching document
4432
* which is on or after {@code max}.

lucene/core/src/java/org/apache/lucene/search/CollectorManager.java

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818

1919
import java.io.IOException;
2020
import java.util.Collection;
21+
import org.apache.lucene.index.LeafReaderContext;
2122

2223
/**
2324
* A manager of collectors. This class is useful to parallelize execution of search requests and has
@@ -31,6 +32,12 @@
3132
* fully collected.
3233
* </ul>
3334
*
35+
* <p><strong>Note:</strong> Multiple {@link LeafCollector}s may be requested for the same {@link
36+
* LeafReaderContext} via {@link Collector#getLeafCollector(LeafReaderContext)} across the different
37+
* {@link Collector}s returned by {@link #newCollector()}. Any computation or logic that needs to
38+
* happen once per segment requires specific handling in the collector manager implementation,
39+
* because the collection of an entire segment may be split across threads.
40+
*
3441
* @see IndexSearcher#search(Query, CollectorManager)
3542
* @lucene.experimental
3643
*/

0 commit comments

Comments
 (0)