Skip to content

Multi-pass aggregation support #50863

Closed as not planned
Closed as not planned
@polyfractal

Description

@polyfractal

Aggregations perform a single pass over the data today, reducing the shard-results at the coordinating node and then applying pipeline aggregations on the reduced results. This is ideal for aggregation latency and scalability, but does limit the types of algorithms we can implement.

We would like to start investigating extending the agg framework to multiple passes over the data. In particular, we want to be able to run multiple map-reduce cycles, not just multiple shard-local passes. This is needed when global information (e.g. a global mean) is required for the second pass over the data.

Multi-pass should unblock a number of interesting aggs and enhancements:

Technicals and Open Questions

  • We should probably reuse the existing SearchPhase mechanisms in place. This will limit the amount of new code that we need to write, and should (hopefully) play nicer with mechanisms like CCS
    • One approach is adding a new phase that is executed after AggregationPhase, which can recursively keep calling itself for the next phase to perform multiple passes
    • Alternatively, we could implement a new phase after AggregationPhase which deals with the n+1 passes
  • Should we limit to two passes? n passes, but limited to low number?
  • How do we pass back state data to the nth passes? Global map of state which all aggs can fetch from (based on known ordinal or something?) Pass down specific state to relevant sub-agg?
  • Are multi-pass aggs a new type of agg that encompasses logic for each stage? Or is a multi-pass agg a regular agg for first pass, and a new type of agg for second/etc passes?
  • Do multi-pass agg share the same syntax as existing aggs, in the same tree?

There are some implications to multi-pass and async search which need resolving. Perhaps multi-pass is implemented on a per-"feature" basis (e.g. a dedicated "cluster endpoint" that does kmeans clustering, instead of trying to modify the agg framework more generically)

Probably a lot more points to consider, just wanted to get an initial braindump down :)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions