-
Couldn't load subscription status.
- Fork 2.3k
Open
Labels
Roadmap:SearchProject-wide roadmap labelProject-wide roadmap labellucenev3.4.0Issues and PRs related to version 3.4.0Issues and PRs related to version 3.4.0
Description
Overview
Some Lucene scorers can now pass a DocIdStream to collector for bulk collection, like DenseConjunctionBulkScorer.
We want to research on what changes in aggregation is needed to adopt that, and how to speed up with that.
Here are some ideas:
- Experiment with new Lucene API released in 10.3
LeafCollector#collectRange- We can enable the DocValuesSkipper and use that pre-aggregated min, max from the skipper to speed up Min,MaxAggregation.
- @asimmahmood1 will try this out. Related issue: Adding logic for histogram aggregation using skiplistΒ #19130
- Experiment with new Lucene API probably in 10.4,
NumericDocValues#longValues,DocIdStream#intoArray- Use these APIs in some OpenSearch aggregations and benchmark accordinly
- Experiment with pushing down the
NumericDocValues#longValuesto Codec level inLucene90DocValuesProducer. Theoretically this is suitable for dense case (all documents have values) and underlying storage format can be read sequentially in bulk efficiently.
- Be aware of the cost of virtual call and try to reduce that by the technique of bulk processing
Works In Progress
- Handpicked related Lucene changes to a specific branch: https://github.com/bowenlan-amzn/lucene/commits/10.2.2-bulkcollect/
- Consume the lucene branch, and use the new API in OpenSearch MaxAggregator: https://github.com/bowenlan-amzn/OpenSearch/commits/bulkcollection/
- Benchmark on
nyc_taxisworkload. The default way takes 33ms, while the bulk collection way takes 80ms. So need deep dive to understand this results. We are expecting ~20% speedup.
Related Lucene Changes
Metadata
Metadata
Assignees
Labels
Roadmap:SearchProject-wide roadmap labelProject-wide roadmap labellucenev3.4.0Issues and PRs related to version 3.4.0Issues and PRs related to version 3.4.0
Type
Projects
Status
Todo
Status
New