Skip to content

Refactor ValuesSource and related classes #42949

Closed
@not-napoleon

Description

@not-napoleon

The aggregations framework defines several similar but subtly different classes & enums to represent what type of data an aggregation operates on. Their correct usage and interaction is not obvious, especially to new developers. This refactoring aims to make input type specification for aggregations easier to understand.

Plan

To replace the assorted hard coded ValuesSource references, we want to build a dynamic registry which will map Field types to Values Sources and specific aggregator implementations.

TODO - before Fast follow after merging to master

TODO - backport

TODO - Post merge

  • After all VSAB aggregations are converted, remove the legacy shim from VSRegistry
  • Stop returning null ValuesSource from VSConfig for unmapped case - see Improve handling of unmapped ValuesSources #53238
  • Deprecation plan for serializeTargetValueType (doesn't strictly need to be part of this refactor). The following classes override this method:
    • CardinalityAggregationBuilder
    • MissingAggregationBuilder
    • ValueCountAggregationBuilder
    • RareTermsAggregationBuilder
    • SignificantTermsAggregationBuilder
    • TermsAggregationBuilder
  • Unify constructor signature (and thus AggregatorSupplier) between Min&Max and other metrics.
  • Clean up user value hint parsing - see Parse user type hints directly to ValuesSourceTypes #53424
  • ValuesSourceRegistry for MultiValuesSourceAggregationBuilder subclasses
  • ValuesSourceRegistry for ArrayValuesSourceAggregationBuilder subclasses
  • Remove remaining uses of ValueType
  • Rename everything.

Out of Scope:

After considerable discussion, we've decided to deal with the multivalued aggregations at a later point. The current abstractions there aren't working as well as we'd like, so investing more in them right now doesn't seem like a good use of time.

Background

Involved Classes

ValuesSource & its subclasses

These are thin wrappers which provide a unified interface to different ways of getting input data (e.g. DocValues from Lucene, or script output). A class hierarchy defines the type of values returned by the source.

ValueSourceType

A small enum who's values roughly correspond to the first level subclasses of ValuesSource. This gets passed as a constructor argument to ValuesSourceConifg as a hint for what type of values source it should later construct.

ValueType

A more robust enum specifying a wider range of types than ValuesSourceType. The instances of this enum provide a mapping back to a ValuesSourceType. This is used when users supply type hints, e.g. as part of a missing value specification.

ValuesSourceConfig

This class resolves the actual ValuesSource to use on a given shard for the aggregation. It considers script values, missing values, and mapped fields.

ValuesSourceAggregationBuilder

This is the base class for all aggregations using the value source model, and it ties together usage of all of the above classes. It is generic over a ValuesSource class, and accepts a ValuesSourceType and two ValueType parameters. How those different value classes interact is not at all obvious; a good first step at this refactoring would be just documenting those relationships.

Next Steps

I'm open to discussion as to what the best way to clean this up should be. In the interest of seeding that discussion, the following are my first-pass suggestions.

Clarify ValuesSourceAggregationBuilder

This would consist of descriptive field names & comments around the intended use of the various type parameters ValuesSourceAggregationBuilder accepts. Some javadoc on the intended roles of ValueSourceType and ValueType would help too.

Formalize relationship between ValuesSourceType and ValuesSource

Currently, these are related via some cascading if statements in ValuesSourceConfig, but this misses out on many of the benefits of having an enum in the first place. Notably, nothing enforces that all enum values are accounted for, and developers wishing to understand the relationship between two classes must look to a third class for that information.

ValueSourceType.ANY

ValuesSourceType.ANY creates a few edge cases. It's the only ValuesSourceType that doesn't map cleanly back to a ValuesSource subclass. ValuesSourceConfig interprets it as a bytes source, except in the case of a script, which allows BYTES but not ANY. Aggregations that can operate on multiple input types (e.g. TermsAggregation, which can operate on strings or numbers) use ValuesSourceType.ANY to indicate this, which is deceptive since later in the process they are restricted to more specific source types.

ValuesSourceType Ordinal Serialization Issue

In addition to any other refactoring work being done here, ValuesSourceType is an enum being serialized by ordinal value, which is prone to error. If we keep this enum, we should adopt a pattern of serializing based on an ID we control, to allow for deletes & reorderings later. See ValueType, which does this correctly.

ValueType.NUMBER and ValueType.NUMERIC

ValueType specifies both NUMBER and NUMERIC, which are apparently identical as far as the enum is concerned. There's a five year old TODO asking about the difference between these two. We should either merge them, or clarify why we need both.

Rename ValuesSource

ValuesSource is very close to ValueSource, which is an interface in org.elasticsearch.ingest. It's easy to pull up the wrong class in Intellij, especially if you've only heard the name not seen it.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions