[BUG] Poor performance of regex queries #5097

Jon-AtAWS · 2022-11-05T23:05:45Z

Describe the bug

The following query:

"simple_query_string": {
    "query": "/.*innerTerm.*/",
    "fields": [],
    "type": "best_fields",
    "default_operator": "or",
    "max_determinized_states": 10000,
    "enable_position_increments": true,
    "fuzziness": "AUTO",
    "fuzzy_prefix_length": 0,
    "fuzzy_max_expansions": 50,
    "phrase_slop": 0,
    "escape": false,
    "auto_generate_synonyms_phrase_query": true,
    "fuzzy_transpositions": true,
    "boost": 1
 }

Is inefficient and runs for many seconds, especially when there are many fields in the index. When there are also many indices in the query, this can take down the cluster.

Expected behavior

This is one clause of a larger query, but because of the way queries are processed, there's no way to force the filtering to occur first. In this case, we can't use a post-filter, because we need accurate values in the aggregations.

We'd like to reduce the overall cost of running the query.

The text was updated successfully, but these errors were encountered:

dblock · 2022-11-07T21:45:51Z

@Jon-AtAWS Do you have any suggestions of how we could express the order of filtering we'd like in this query to let it happen both ways?

Jon-AtAWS · 2022-11-08T14:14:54Z

So, let's say this is part of a larger query, like (p-code, excuse the lack of syntax)

bool
  filter
    bool
      must
        some time range
        some field filter
        simple-query-string: regex

And, let's say I know that the time range and field filter will reduce the match set to a few hundred docs, vs. evaluating the cost of the regex as part of the query planning.

Just brainstorming...

We could add an "execution_order": or similar to the clauses. This is likely too hard to guarantee
We could add something like "weight": that gets passed in to Lucene to either substitute for the cost function or modify it
We could add something time related like "stop_after":

We can also work on how Lucene approximates regex queries, to increase their weight an order of magnitude or 2.

msfroh · 2023-09-17T23:18:04Z

This may benefit from #7057.

Alternatively, it might have already improved in newer versions of Lucene, since (IIRC) it no longer precomputes a bitset of all matching docs across all fields.

Jon-AtAWS · 2023-09-18T14:38:31Z

The release of search pipelines fixes this problem in one way. We should evaluate the cost of splitting out filters like the above and managing pipeline flow vs. pushing them all into a single query and having Lucene optimize.

msfroh · 2024-05-21T18:10:25Z

We're also adding support for the wildcard field type to make these queries better (if you plan your mappings around wildcard matching): #5639

Jon-AtAWS added bug Something isn't working untriaged labels Nov 5, 2022

andrross added the Indexing & Search label Nov 7, 2022

anasalkouz added enhancement Enhancement or improvement to existing feature or request Performance This is for any performance related enhancements or bugs and removed untriaged bug Something isn't working labels Nov 8, 2022

anasalkouz added Migration:Backlog and removed Migration:Backlog labels Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Poor performance of regex queries #5097

[BUG] Poor performance of regex queries #5097

Jon-AtAWS commented Nov 5, 2022

dblock commented Nov 7, 2022

Jon-AtAWS commented Nov 8, 2022

msfroh commented Sep 17, 2023

Jon-AtAWS commented Sep 18, 2023

msfroh commented May 21, 2024

[BUG] Poor performance of regex queries #5097

[BUG] Poor performance of regex queries #5097

Comments

Jon-AtAWS commented Nov 5, 2022

dblock commented Nov 7, 2022

Jon-AtAWS commented Nov 8, 2022

msfroh commented Sep 17, 2023

Jon-AtAWS commented Sep 18, 2023

msfroh commented May 21, 2024