Description
Aggregations perform a single pass over the data today, reducing the shard-results at the coordinating node and then applying pipeline aggregations on the reduced results. This is ideal for aggregation latency and scalability, but does limit the types of algorithms we can implement.
We would like to start investigating extending the agg framework to multiple passes over the data. In particular, we want to be able to run multiple map-reduce cycles, not just multiple shard-local passes. This is needed when global information (e.g. a global mean) is required for the second pass over the data.
Multi-pass should unblock a number of interesting aggs and enhancements:
- Spatial Stats spatial_stats aggregation #14727
- Conflation Conflation Aggregator #11460
- Numeric and Geo Clustering (2D and low dimension
nD
, k-means), Add K-means clustering feature #5512 - "other" bucket for Terms agg Terms agg: calculate aggs on 'other' bucket #12411
- Accurate counts on terms agg
Technicals and Open Questions
- We should probably reuse the existing
SearchPhase
mechanisms in place. This will limit the amount of new code that we need to write, and should (hopefully) play nicer with mechanisms like CCS- One approach is adding a new phase that is executed after
AggregationPhase
, which can recursively keep calling itself for the next phase to perform multiple passes - Alternatively, we could implement a new phase after
AggregationPhase
which deals with then+1
passes
- One approach is adding a new phase that is executed after
- Should we limit to two passes?
n
passes, but limited to low number? - How do we pass back state data to the nth passes? Global map of state which all aggs can fetch from (based on known ordinal or something?) Pass down specific state to relevant sub-agg?
- Are multi-pass aggs a new type of agg that encompasses logic for each stage? Or is a multi-pass agg a regular agg for first pass, and a new type of agg for second/etc passes?
- Do multi-pass agg share the same syntax as existing aggs, in the same tree?
There are some implications to multi-pass and async search which need resolving. Perhaps multi-pass is implemented on a per-"feature" basis (e.g. a dedicated "cluster endpoint" that does kmeans clustering, instead of trying to modify the agg framework more generically)
Probably a lot more points to consider, just wanted to get an initial braindump down :)