Mixing regexp-based and set-based include and exclude in *Terms aggregations

# Background

The Terms, Significant Terms and Rare Terms aggregations support the `include` and `exclude` options to filter the buckets via either:

* a single regexp e.g. `"this.*|that.*"`
* a list of exact terms e.g. `["thisTerm", "thatTerm"]`

You can give both an `include` and an `exclude` at the same time, but they have to be the same type, both regexp-based or set-based. If you mix and match, you get an error (and this is not documented BTW).

```
        "reason": "[8:34] [terms] failed to parse field [include]",
        "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Cannot mix a set-based include with a regex-based method"
        }
```

# The problem

The problem we faced is that we need both a regexp-based include and a set-based exclude. We thought about converting the set-based exclude into a regexp-based exclude so that both could be regexps, like so:

    ["term1", "term2", "term3"] -> "term1|term2|term3"

And obviously both give exactly the same buckets in the aggregation... But the performance is way worse with the regexp.

To give an idea of how worse the performance is, this is the performance of a typical aggregation that we can make on our index (to allow testing and comparing both set and regexp, I removed any include parameter):

* without exclude: ~300 ms
* with a set-based exclude containing 100 terms: ~500 ms
* with a regexp-based exclude containing the same 100 terms: ~10 s

So, while setting a set-based exclude is just a bit slower than no exclude, the regexp-based exclude is *20 times slower* than the set-based one. We can't afford that kind of performance unfortunately.

# The proposed solution

We would like to lift that limitation about having to use the same type of include and exclude. We want to be able to mix and match both kinds of include with both kinds of exclude. That way, we could use a regexp when we need the flexibility of one, and use a set when we can, to keep performance high.

I've seen the relevant code (`server/src/main/java/org/elasticsearch/search/aggregations/bucket/terms/IncludeExclude.java`) and just by a bit of refactoring in this file, I've been able to make a proof of concept that achieves that goal. I've confirmed that mixing a regexp and set give the performance we expect, which is much faster that two regexps.

So, I'm opening this issue to gather feedback before hopefully getting the green light to implement this properly in a pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mixing regexp-based and set-based include and exclude in *Terms aggregations #62246

Background

The problem

The proposed solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mixing regexp-based and set-based include and exclude in *Terms aggregations #62246

Description

Background

The problem

The proposed solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions