Closed
Description
This is a meta issue discussing a complete rewrite of the Elasticsearch rollup codebase with the aim to improve the following points:
- Drop rollup jobs in favour of integrating rollups with ILM. This means that rolling up an index will work similarly to shrinking an index. The rollup will be done when indexing is complete and the action will rollup the entire index at the same time.
- Make rollup functionality easier to setup and administer from an operational point of view. For example, allow limited editing of existing rollup configuration (add new metrics etc)
- Make rollup indices behave much more like regular indices, simplifying querying and management.
- Improve reliability. Existing rollup jobs are not atomic and sometimes fail midway, leaving rollup indices that are not complete. We should make the rollup computation atomic.
- Improve performance of rollup jobs. Some large-scale use cases can run into bottlenecks where the search phase of rollup is not fast enough (due to limited thread involvement across cluster).
- Implement support for pre-aggregated data structures to enable cardinality, percentiles [Rollup] Support for data-structure based metrics (Cardinality, Percentiles, etc) #33214
Below we outline a high level plan of changes that will help us achieve the above goals:
- Remove the
_rollup_search
endpoint in favour of implementing search on rollup indexes within the_search
endpoint ([RollupV2] Implement search resolution #67783) @csoulios - Create an
aggregate_metric
field type that will store pre-aggregated metrics (min
,max
,sum
etc) for a specific rollup group. (Implement aggregate_metric field mapper #49830) - Modify existing metric aggregations so that they operate on the
aggregate_metric
fields. Theaggregate_metric
field type will provide the correct metric to the requesting aggregator, while using the same field name. This field type will not be indexed but stored as a binary doc values. (Implement aggregations on aggregate metrics #53986) - Modify aggregators to pull doc count rather than just incrementing. Allows the aggregator framework to treat a single "rollup" document as if it were multiple "raw" documents. (Add doc_count field mapper #58339)
- Create a
rollup_meta
cluster metadata to store information such as date_histogram interval and timezone etc. (Add RollupV2 cluster metadata behind feature-flag #64680) - Create new Rollup endpoint to rollup indices based on the new model (Adds a new Rollup Action #64900)
- Create an ILM rollup action and remove the rollup job functionality. issue, PR @talevy
- Refactor out all the Rollup-V2 mentions in the code-base in favor of the Rollup-Action issue @talevy
- [Docs] Rollup Documentation (meta tracking issue: [DOCS] Rollup refactor docs #65515)
- Fix muted tests that are flaky (Various RollupActionSingleNodeTests failing with IndexNotFoundException #69799)
- Add ability for doing rollups of rollups. This means supporting
_doc_count
field andaggregate_metric_double
fieldtypes (Add RollupAction support for AggregateDoubleMetric fields #70534) - Rollup Action Improvements and Testing
- Action should ensure that relevant index-metadata/settings is copied to rollup index (hidden index?, tier?, ILM Execution state?)
- Resolving simplification strategies (Simplify ILM Policy solution for managing lifecycle of rollup indices #70334)
- (Maybe, depending on ILM Simplification) Add ability for Rollup ILM Action to optionally-delete original index upon rollup. This means atomically removing the original index and adding the rollup index into the datastream backing indices.
- Rollup Action must throw exception when rollup index already exists issue
- DateHistogramGroupConfig is re-using the original config with a
delay
option, need to make a new Action-specific config without a delay - add more config verification before executing the action
- Add more thorough tests for various groupings in the indexer
- Add Rally benchmark for rolling up indices
- investigate Jim's new broadcast action gh compare link