-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Add multi_terms aggs #67597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi_terms aggs #67597
Conversation
Adds a multi_terms aggregation support. The multi terms aggregation works very similarly to the terms aggregation but supports multiple terms. The goal of this PR is to add the basic functionality so it is not optimized at the moment. It will be done in follow up PRs. Closes elastic#65623
Pinging @elastic/es-analytics-geo (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions:
- What happens if the user specifies both a top-level value type hint and a per-field value type hint?
- Is there a reason to not always operate in heterogeneous mode, and just copy the top level type down to each field? That seems clearer, in that there's only one execution path.
I think this is good, but we need to really figure out what we're doing with multi values source aggs and the registry. I'll push pondering that up in my mental priority.
AbstractObjectParser<? extends MultiValuesSourceAggregationBuilder<?>, T> objectParser, | ||
boolean scriptable, boolean timezoneAware, boolean filterable) { | ||
boolean scriptable, boolean timezoneAware, boolean filterable, boolean heterogeneous) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add some javadoc for what the new heterogeneous
parameter is for. It makes sense in context, but I don't trust myself to remember the next time I'm looking at this code.
It is not really possible at the moment since all aggregations that have top value hint don't use heterogenous hint.
Yes, it makes sense, but I can switch other aggs to heterogenous mode in a separate PR, this PR is already monstronous, I think it might be cleaner to do this as a follow up PRs to keep this PR from bloating even more :) |
@elasticmachine run elasticsearch-ci/docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review is still WIP, but here are the comments so far. Figured I'd checkpoint myself. :)
Still have another dozen files to look through, but looking good so far!
.../src/main/java/org/elasticsearch/search/aggregations/bucket/terms/AbstractInternalTerms.java
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/xpack/analytics/multiterms/MultiTermsAggregationBuilder.java
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/xpack/analytics/multiterms/MultiTermsAggregationBuilder.java
Outdated
Show resolved
Hide resolved
...analytics/src/main/java/org/elasticsearch/xpack/analytics/multiterms/InternalMultiTerms.java
Outdated
Show resolved
Hide resolved
...analytics/src/main/java/org/elasticsearch/xpack/analytics/multiterms/InternalMultiTerms.java
Show resolved
Hide resolved
...analytics/src/main/java/org/elasticsearch/xpack/analytics/multiterms/InternalMultiTerms.java
Show resolved
Hide resolved
...analytics/src/main/java/org/elasticsearch/xpack/analytics/multiterms/InternalMultiTerms.java
Show resolved
Hide resolved
@polyfractal thanks for the initial review. I think I have addressed your comments. Would you mind taking another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more comments/questions, but otherwise LGTM assuming these don't open up a big can of worms :)
} else { | ||
this.collectMode = collectMode; | ||
} | ||
// TODO: Make a function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a real TODO or just c/p from terms agg? E.g. is it a todo for this PR or just generally :)
for (TermValues termValues : termValuesList) { | ||
List<Object> values = termValues.collectValues(doc); | ||
if (values == null) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I interpreting this correct that if one of the doc fields is missing/null, we basically skip the entire document?
If yes, presumably setting missing
is the required action if you want the document to be aggregated? Do we make a note of that anywhere in the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the documentation of missing parameter we have:
By default if any of the key components are missing the entire document will be ignored
but it is also possible to treat them as if they had a value.
I think it might have been not very clear. So I rephrased it a bit.
InternalMultiTerms.Bucket[][] topBucketsPerOrd = new InternalMultiTerms.Bucket[owningBucketOrds.length][]; | ||
long[] otherDocCounts = new long[owningBucketOrds.length]; | ||
for (int ordIdx = 0; ordIdx < owningBucketOrds.length; ordIdx++) { | ||
collectZeroDocEntriesIfNeeded(owningBucketOrds[ordIdx]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just remove this functionality for multi-terms? It's always been strange functionality in terms imo, and I could see it being very expensive for multi-terms if there are a lot of high cardinality fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. We should probably remove it.
if (values.advanceExact(doc)) { | ||
List<Object> objects = new ArrayList<>(); | ||
int valuesCount = values.docValueCount(); | ||
long previous = Long.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the last value really was Long.MAX_VALUE? we accept that as an indexable value rigt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the first value (not the last value)? The first value is handled a 3 lines below in
if (previous != val || i == 0) {
The last value shouldn't be an issues unless I am missing something. I think the initial assignment here is not important (it's there to just make the compiler happy). This code was heavily inspired by the terms aggs and didn't see reason to deviate from the terms agg implementation here.
if (values.advanceExact(doc)) { | ||
List<Object> objects = new ArrayList<>(); | ||
int valuesCount = values.docValueCount(); | ||
double previous = Double.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto to question about longs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
import org.elasticsearch.search.lookup.LeafDocLookup; | ||
import org.elasticsearch.xpack.analytics.AnalyticsPlugin; | ||
|
||
public class MultiTermsAggregatorTests extends AggregatorTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a test (either here, or in the yaml tests) which uses a child metric agg somewhere, just to smoke test that multi-terms does the right thing when child aggs are used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Child aggs are in a module, so we can only add nested or add it into yaml. I will add it to yaml.
@elasticmachine update branch |
Adds a multi_terms aggregation support. The multi terms aggregation works very similarly to the terms aggregation but supports multiple terms. The goal of this PR is to add the basic functionality so it is not optimized at the moment. It will be done in follow up PRs. Closes elastic#65623
Adds a multi_terms aggregation support. The multi terms aggregation works very similarly to the terms aggregation but supports multiple terms. The goal of this PR is to add the basic functionality so it is not optimized at the moment. It will be done in follow up PRs. Closes #65623
Adds a multi_terms aggregation support. The multi terms aggregation works
very similarly to the terms aggregation but supports multiple terms. The goal
of this PR is to add the basic functionality so it is not optimized at the
moment. It will be done in follow up PRs.
Closes #65623