Description
Background
The Terms, Significant Terms and Rare Terms aggregations support the include
and exclude
options to filter the buckets via either:
- a single regexp e.g.
"this.*|that.*"
- a list of exact terms e.g.
["thisTerm", "thatTerm"]
You can give both an include
and an exclude
at the same time, but they have to be the same type, both regexp-based or set-based. If you mix and match, you get an error (and this is not documented BTW).
"reason": "[8:34] [terms] failed to parse field [include]",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Cannot mix a set-based include with a regex-based method"
}
The problem
The problem we faced is that we need both a regexp-based include and a set-based exclude. We thought about converting the set-based exclude into a regexp-based exclude so that both could be regexps, like so:
["term1", "term2", "term3"] -> "term1|term2|term3"
And obviously both give exactly the same buckets in the aggregation... But the performance is way worse with the regexp.
To give an idea of how worse the performance is, this is the performance of a typical aggregation that we can make on our index (to allow testing and comparing both set and regexp, I removed any include parameter):
- without exclude: ~300 ms
- with a set-based exclude containing 100 terms: ~500 ms
- with a regexp-based exclude containing the same 100 terms: ~10 s
So, while setting a set-based exclude is just a bit slower than no exclude, the regexp-based exclude is 20 times slower than the set-based one. We can't afford that kind of performance unfortunately.
The proposed solution
We would like to lift that limitation about having to use the same type of include and exclude. We want to be able to mix and match both kinds of include with both kinds of exclude. That way, we could use a regexp when we need the flexibility of one, and use a set when we can, to keep performance high.
I've seen the relevant code (server/src/main/java/org/elasticsearch/search/aggregations/bucket/terms/IncludeExclude.java
) and just by a bit of refactoring in this file, I've been able to make a proof of concept that achieves that goal. I've confirmed that mixing a regexp and set give the performance we expect, which is much faster that two regexps.
So, I'm opening this issue to gather feedback before hopefully getting the green light to implement this properly in a pull request.