Skip to content

Add multi_terms aggs #67597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Feb 3, 2021
Merged

Add multi_terms aggs #67597

merged 15 commits into from
Feb 3, 2021

Conversation

imotov
Copy link
Contributor

@imotov imotov commented Jan 15, 2021

Adds a multi_terms aggregation support. The multi terms aggregation works
very similarly to the terms aggregation but supports multiple terms. The goal
of this PR is to add the basic functionality so it is not optimized at the
moment. It will be done in follow up PRs.

Closes #65623

Adds a multi_terms aggregation support. The multi terms aggregation works
very similarly to the terms aggregation but supports multiple terms. The goal
of this PR is to add the basic functionality so it is not optimized at the
moment. It will be done in follow up PRs.

Closes elastic#65623
@imotov imotov marked this pull request as ready for review January 15, 2021 21:59
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions:

  • What happens if the user specifies both a top-level value type hint and a per-field value type hint?
  • Is there a reason to not always operate in heterogeneous mode, and just copy the top level type down to each field? That seems clearer, in that there's only one execution path.

I think this is good, but we need to really figure out what we're doing with multi values source aggs and the registry. I'll push pondering that up in my mental priority.

AbstractObjectParser<? extends MultiValuesSourceAggregationBuilder<?>, T> objectParser,
boolean scriptable, boolean timezoneAware, boolean filterable) {
boolean scriptable, boolean timezoneAware, boolean filterable, boolean heterogeneous) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add some javadoc for what the new heterogeneous parameter is for. It makes sense in context, but I don't trust myself to remember the next time I'm looking at this code.

@imotov
Copy link
Contributor Author

imotov commented Jan 27, 2021

  • What happens if the user specifies both a top-level value type hint and a per-field value type hint?

It is not really possible at the moment since all aggregations that have top value hint don't use heterogenous hint.

  • Is there a reason to not always operate in heterogeneous mode, and just copy the top level type down to each field? That seems clearer, in that there's only one execution path.

Yes, it makes sense, but I can switch other aggs to heterogenous mode in a separate PR, this PR is already monstronous, I think it might be cleaner to do this as a follow up PRs to keep this PR from bloating even more :)

@imotov
Copy link
Contributor Author

imotov commented Jan 27, 2021

@elasticmachine run elasticsearch-ci/docs

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review is still WIP, but here are the comments so far. Figured I'd checkpoint myself. :)

Still have another dozen files to look through, but looking good so far!

@imotov
Copy link
Contributor Author

imotov commented Feb 1, 2021

@polyfractal thanks for the initial review. I think I have addressed your comments. Would you mind taking another look?

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments/questions, but otherwise LGTM assuming these don't open up a big can of worms :)

} else {
this.collectMode = collectMode;
}
// TODO: Make a function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a real TODO or just c/p from terms agg? E.g. is it a todo for this PR or just generally :)

for (TermValues termValues : termValuesList) {
List<Object> values = termValues.collectValues(doc);
if (values == null) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I interpreting this correct that if one of the doc fields is missing/null, we basically skip the entire document?

If yes, presumably setting missing is the required action if you want the document to be aggregated? Do we make a note of that anywhere in the docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the documentation of missing parameter we have:

By default if any of the key components are missing the entire document will be ignored
but it is also possible to treat them as if they had a value.

I think it might have been not very clear. So I rephrased it a bit.

InternalMultiTerms.Bucket[][] topBucketsPerOrd = new InternalMultiTerms.Bucket[owningBucketOrds.length][];
long[] otherDocCounts = new long[owningBucketOrds.length];
for (int ordIdx = 0; ordIdx < owningBucketOrds.length; ordIdx++) {
collectZeroDocEntriesIfNeeded(owningBucketOrds[ordIdx]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just remove this functionality for multi-terms? It's always been strange functionality in terms imo, and I could see it being very expensive for multi-terms if there are a lot of high cardinality fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. We should probably remove it.

if (values.advanceExact(doc)) {
List<Object> objects = new ArrayList<>();
int valuesCount = values.docValueCount();
long previous = Long.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the last value really was Long.MAX_VALUE? we accept that as an indexable value rigt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the first value (not the last value)? The first value is handled a 3 lines below in

if (previous != val || i == 0) {

The last value shouldn't be an issues unless I am missing something. I think the initial assignment here is not important (it's there to just make the compiler happy). This code was heavily inspired by the terms aggs and didn't see reason to deviate from the terms agg implementation here.

if (values.advanceExact(doc)) {
List<Object> objects = new ArrayList<>();
int valuesCount = values.docValueCount();
double previous = Double.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto to question about longs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

import org.elasticsearch.search.lookup.LeafDocLookup;
import org.elasticsearch.xpack.analytics.AnalyticsPlugin;

public class MultiTermsAggregatorTests extends AggregatorTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test (either here, or in the yaml tests) which uses a child metric agg somewhere, just to smoke test that multi-terms does the right thing when child aggs are used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Child aggs are in a module, so we can only add nested or add it into yaml. I will add it to yaml.

@mark-vieira
Copy link
Contributor

@elasticmachine update branch

@imotov imotov merged commit 9e3384e into elastic:master Feb 3, 2021
imotov added a commit to imotov/elasticsearch that referenced this pull request Feb 3, 2021
Adds a multi_terms aggregation support. The multi terms aggregation works
very similarly to the terms aggregation but supports multiple terms. The goal
of this PR is to add the basic functionality so it is not optimized at the
moment. It will be done in follow up PRs.

Closes elastic#65623
imotov added a commit that referenced this pull request Feb 4, 2021
Adds a multi_terms aggregation support. The multi terms aggregation works
very similarly to the terms aggregation but supports multiple terms. The goal
of this PR is to add the basic functionality so it is not optimized at the
moment. It will be done in follow up PRs.

Closes #65623
@imotov imotov deleted the issue-65623-multi-terms branch February 4, 2021 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.12.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for multi-field keys to terms aggs
7 participants