Skip to content

[META] Use Lucene bulk collection API to speed up aggregationΒ #19324

@bowenlan-amzn

Description

@bowenlan-amzn

Overview

Some Lucene scorers can now pass a DocIdStream to collector for bulk collection, like DenseConjunctionBulkScorer.
We want to research on what changes in aggregation is needed to adopt that, and how to speed up with that.

Here are some ideas:

  • Experiment with new Lucene API released in 10.3 LeafCollector#collectRange
  • Experiment with new Lucene API probably in 10.4, NumericDocValues#longValues, DocIdStream#intoArray
    • Use these APIs in some OpenSearch aggregations and benchmark accordinly
    • Experiment with pushing down the NumericDocValues#longValues to Codec level in Lucene90DocValuesProducer. Theoretically this is suitable for dense case (all documents have values) and underlying storage format can be read sequentially in bulk efficiently.
  • Be aware of the cost of virtual call and try to reduce that by the technique of bulk processing

Works In Progress

Related Lucene Changes

  • Enable collectors to take advantage of pre-aggregated data. #14401
  • Add bulk-retrieval API to NumericDocValues. #15149
  • Help collectors take advantage of bulk-retrieval of doc values. #15173

Metadata

Metadata

Assignees

Labels

Roadmap:SearchProject-wide roadmap labellucenev3.4.0Issues and PRs related to version 3.4.0

Type

No type

Projects

Status

Todo

Status

New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions