Multi-pass aggregation support

Aggregations perform a single pass over the data today, reducing the shard-results at the coordinating node and then applying pipeline aggregations on the reduced results.  This is ideal for aggregation latency and scalability, but does limit the types of algorithms we can implement.

We would like to start investigating extending the agg framework to multiple passes over the data.  In particular, we want to be able to run multiple map-reduce cycles, not just multiple shard-local passes.  This is needed when global information (e.g. a global mean) is required for the second pass over the data.

Multi-pass should unblock a number of interesting aggs and enhancements:

- Spatial Stats https://github.com/elastic/elasticsearch/issues/14727
- Conflation https://github.com/elastic/elasticsearch/issues/11460
- Numeric and Geo Clustering (2D and low dimension `nD`, k-means), https://github.com/elastic/elasticsearch/issues/5512
- "other" bucket for Terms agg https://github.com/elastic/elasticsearch/issues/12411
- Accurate counts on terms agg

### Technicals and Open Questions

- We should probably reuse the existing `SearchPhase` mechanisms in place.  This will limit the amount of new code that we need to write, and should (hopefully) play nicer with mechanisms like CCS
  - One approach is adding a new phase that is executed after `AggregationPhase`, which can recursively keep calling itself for the next phase to perform multiple passes
  - Alternatively, we could implement a new phase after `AggregationPhase` which deals with the `n+1` passes
- Should we limit to two passes?  `n` passes, but limited to low number?
- How do we pass back state data to the nth passes?  Global map of state which all aggs can fetch from (based on known ordinal or something?)  Pass down specific state to relevant sub-agg?
- Are multi-pass aggs a new type of agg that encompasses logic for each stage?  Or is a multi-pass agg a regular agg for first pass, and a new type of agg for second/etc passes?
- Do multi-pass agg share the same syntax as existing aggs, in the same tree?

There are some implications to multi-pass and async search which need resolving.  Perhaps multi-pass is implemented on a per-"feature" basis (e.g. a dedicated "cluster endpoint" that does kmeans clustering, instead of trying to modify the agg framework more generically)


Probably a lot more points to consider, just wanted to get an initial braindump down :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-pass aggregation support #50863

Technicals and Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-pass aggregation support #50863

Description

Technicals and Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions