Description
The aggregations framework defines several similar but subtly different classes & enums to represent what type of data an aggregation operates on. Their correct usage and interaction is not obvious, especially to new developers. This refactoring aims to make input type specification for aggregations easier to understand.
Plan
To replace the assorted hard coded ValuesSource
references, we want to build a dynamic registry which will map Field types to Values Sources and specific aggregator implementations.
TODO - before Fast follow after merging to master
- Get hard coded
ValuesSources
out ofValuesSourceConfig
-
toValuesSource
(Extract Empty/Script/Missing ValuesSource behavior to an interface #48320) -
resolve
- This will get done with the registry prototype
-
- Registry Prototype (ValuesSourceRegistry Prototype #48758)
- Look up aggregators by
ValueSourceType
(working in prototype for simple case) - Resolve
ValuesSourceType
dynamically
- Look up aggregators by
- Formalize registration process
- Finalize where registration happens for core aggregations
- Finalize where registration happens for plugin aggregations & aggregators
- Replace prototype singleton with production-ready implemetation Plumb ValuesSourceRegistry through to QuerySearchContext #51710
- Demonstrate that we can correctly register a new type from a plugin. ValuesSource Refactor: move histo VSType into XPack module #53298
- Convert Aggregations
- Core
- Geo Centroid (@talevy) Wire-up geo_centroid agg to ValuesSourceRegistry #53040
- Geo Hash Grid (@talevy) Wire-up geotile_grid and geohash_grid to ValuesSourceRegistry #53037
- Geo Tile Grid (@talevy) Wire-up geotile_grid and geohash_grid to ValuesSourceRegistry #53037
- Auto Date Histogram (Tozzi) Wire up AutoDateHistogram to the ValuesSourceRegistry #55687
- Date Histogram (Tozzi) VSRefactor wire up DateHistogram #53484
- Histogram
- Missing ValuesSource refactoring: wire up Missing aggregation #53511
- Date Range Vs refactor wire up ranges and date ranges #52918
- Geo Distance (@talevy) (depends on RangeAggregator)
- IP Range Wire up IpRangeAggregation to the ValuesSourceRegistry #55831
- Range Vs refactor wire up ranges and date ranges #52918
- Diversified
- Significant Terms ValuesSource refactoring: Wire up SigTerms aggregation #52590
[ ] Significant Text (@polyfractal WIP)not VSAB- Rare Terms Convert RareTerms to new VS registry #52166
- Terms - Rewire terms agg to use new VS registry #51182
- Average - ValuesSource refactoring: Wire up Avg aggregation #52752
- Cardinality - Wire Cardinality aggregation to work with the ValuesSourceRegistry #51337
- Extended Stats - ValuesSource refactoring: Wire up ExtendedStats aggregation #53227
- Geo Bounds (@talevy) Add more tests in GeoBoundsAggregatorTests #53033 Wire up geo_bounds aggregation to ValuesSourceRegistry #53034
- Max - Wire up Max & Min aggregations #52219
- Median - VS refactoring: Wire up median (MAD) aggregation #52945
- Min - Wire up Max & Min aggregations #52219
- Percentile Rank - Wire PercentileRanks aggregator into new VS framework #51693
- Percentiles - Wire Percentiles aggregator into new VS framework #51639
- Histo field / percentiles @polyfractal ValuesSource Refactor: move histo VSType into XPack module #53298
- Stats - ValuesSource refactoring: Wire up stats aggregation #52891
- Sum - ValuesSource refactoring: Wire up Sum aggregation #52571
- Value Count - Wire up Value Count #52225
- Analytics Package
- Parent Join Package
- Children (Tozzi)
- Parent (Tozzi)
- Core
- Remove the EquivalenceType/isA logic in
ValueType
. This will involve changing how we do type checking of userValueTypeHints in the parser. Vs refactor parser cleanup #53198 - Additional Testing
- Numeric metric aggs (e.g. Max) formatting for date values
- Mock-script unit tests for
ValuesSource
based aggs - Unmapped & missing tests for Auto Date Histogram
- Unmapped & Missing tests for Geo aggs
- Add the supported type tests to the remaining aggs
- Plugin support for automated field test
- Remove (or at least dramatically decrease the surface area of)
ValueType
- Non-VSAB aggregations still use
ValueType
, but refactoring those is out of scope for the initial merge. - VSAB still uses
ValueType
for parsing user value hints. There's a post-merge PR for removing that.
- Non-VSAB aggregations still use
- Remove
ValuesSourceType
info fromValuesSourceAggregationBuilder
(Remove ValuesSourceType argument to ValuesSourceAggregationBuilder #48638) - Remove generic from
ValuesSourceConfig
(and probably other places) - Remove remaining special behavior around
ValuesSourceType.ANY
& and remove ANY Remove ValuesSourceType.ANY #51539 - Don't rely on serializing
CoreValuesSourceType
CoreValuesSourceType no longer implements Writable #51276 - Remove any remaining hard coded
ValuesSource
references - Make
ValuesSourceConfig
immutable Soft immutability for VSConfig #52729 - Package Docs for
o.e.s.aggregations.support
(Tozzi) Vs refactor javadoc and comment cleanup #53427 - Ask Kibana QA to run their tests as a real-world integration/smoke test for missed or accidentally altered semantics - We'll get this for "free" once we merge to master
- CCS Testing with 7.x
TODO - backport
- Remove
List.of()
usage - CCS Testing with 6.x
- Extra-careful review of exception types to make sure we don't have breaking changes
- PRs to cherry pick from master:
- Tests for agg missing values Tests for agg missing values #51068 (Tozzi)
- Refactor Percentiles/Ranks aggregation builders and factories Refactor Percentiles/Ranks aggregation builders and factories #51887 (PR: [7.x] Refactor Percentiles/Ranks aggregation builders and factories (#51887) #54537)
- Unit tests for Range and DateRange aggs Unit tests for Range and DateRange aggs #52380 (Tozzi)
- Comprehensively test supported/unsupported field type:agg combinations Comprehensively test supported/unsupported field type:agg combinations #52493 (PR: [7.x] Comprehensively test supported/unsupported field type:agg combinations #54451)
- Add supported type tests for Percentiles/Ranks Add supported type tests for Percentiles/Ranks #52597
- Use newIndexSearcher() to avoid incompatible readers in AggregatorTestCase#testSupportedFieldTypes Use newIndexSearcher() to avoid incompatible readers in AggregatorTestCase#testSupportedFieldTypes #52723 (PR: [7.x] Comprehensively test supported/unsupported field type:agg combinations #54451)
- Tests introduced in Agg test coverage for unmapped, scripting, supported-types #53431
TODO - Post merge
- After all VSAB aggregations are converted, remove the legacy shim from VSRegistry
- Stop returning null ValuesSource from VSConfig for unmapped case - see Improve handling of unmapped ValuesSources #53238
- Deprecation plan for
serializeTargetValueType
(doesn't strictly need to be part of this refactor). The following classes override this method:CardinalityAggregationBuilder
MissingAggregationBuilder
ValueCountAggregationBuilder
RareTermsAggregationBuilder
SignificantTermsAggregationBuilder
TermsAggregationBuilder
- Unify constructor signature (and thus
AggregatorSupplier
) between Min&Max and other metrics. - Clean up user value hint parsing - see Parse user type hints directly to ValuesSourceTypes #53424
- ValuesSourceRegistry for
MultiValuesSourceAggregationBuilder
subclasses - ValuesSourceRegistry for
ArrayValuesSourceAggregationBuilder
subclasses - Remove remaining uses of
ValueType
- Rename everything.
Out of Scope:
After considerable discussion, we've decided to deal with the multivalued aggregations at a later point. The current abstractions there aren't working as well as we'd like, so investing more in them right now doesn't seem like a good use of time.
Background
Involved Classes
ValuesSource & its subclasses
These are thin wrappers which provide a unified interface to different ways of getting input data (e.g. DocValues from Lucene, or script output). A class hierarchy defines the type of values returned by the source.
ValueSourceType
A small enum who's values roughly correspond to the first level subclasses of ValuesSource
. This gets passed as a constructor argument to ValuesSourceConifg
as a hint for what type of values source it should later construct.
ValueType
A more robust enum specifying a wider range of types than ValuesSourceType
. The instances of this enum provide a mapping back to a ValuesSourceType
. This is used when users supply type hints, e.g. as part of a missing value specification.
ValuesSourceConfig
This class resolves the actual ValuesSource
to use on a given shard for the aggregation. It considers script values, missing values, and mapped fields.
ValuesSourceAggregationBuilder
This is the base class for all aggregations using the value source model, and it ties together usage of all of the above classes. It is generic over a ValuesSource
class, and accepts a ValuesSourceType
and two ValueType
parameters. How those different value classes interact is not at all obvious; a good first step at this refactoring would be just documenting those relationships.
Next Steps
I'm open to discussion as to what the best way to clean this up should be. In the interest of seeding that discussion, the following are my first-pass suggestions.
Clarify ValuesSourceAggregationBuilder
This would consist of descriptive field names & comments around the intended use of the various type parameters ValuesSourceAggregationBuilder
accepts. Some javadoc on the intended roles of ValueSourceType
and ValueType
would help too.
Formalize relationship between ValuesSourceType
and ValuesSource
Currently, these are related via some cascading if
statements in ValuesSourceConfig
, but this misses out on many of the benefits of having an enum in the first place. Notably, nothing enforces that all enum values are accounted for, and developers wishing to understand the relationship between two classes must look to a third class for that information.
ValueSourceType.ANY
ValuesSourceType.ANY
creates a few edge cases. It's the only ValuesSourceType
that doesn't map cleanly back to a ValuesSource
subclass. ValuesSourceConfig
interprets it as a bytes source, except in the case of a script, which allows BYTES
but not ANY
. Aggregations that can operate on multiple input types (e.g. TermsAggregation, which can operate on strings or numbers) use ValuesSourceType.ANY
to indicate this, which is deceptive since later in the process they are restricted to more specific source types.
ValuesSourceType Ordinal Serialization Issue
In addition to any other refactoring work being done here, ValuesSourceType
is an enum being serialized by ordinal value, which is prone to error. If we keep this enum, we should adopt a pattern of serializing based on an ID we control, to allow for deletes & reorderings later. See ValueType
, which does this correctly.
ValueType.NUMBER
and ValueType.NUMERIC
ValueType
specifies both NUMBER
and NUMERIC
, which are apparently identical as far as the enum is concerned. There's a five year old TODO asking about the difference between these two. We should either merge them, or clarify why we need both.
Rename ValuesSource
ValuesSource
is very close to ValueSource
, which is an interface in org.elasticsearch.ingest
. It's easy to pull up the wrong class in Intellij, especially if you've only heard the name not seen it.