Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Poor performance of regex queries #5097

Open
Jon-AtAWS opened this issue Nov 5, 2022 · 5 comments
Open

[BUG] Poor performance of regex queries #5097

Jon-AtAWS opened this issue Nov 5, 2022 · 5 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing & Search Performance This is for any performance related enhancements or bugs

Comments

@Jon-AtAWS
Copy link
Member

Describe the bug

The following query:

"simple_query_string": {
    "query": "/.*innerTerm.*/",
    "fields": [],
    "type": "best_fields",
    "default_operator": "or",
    "max_determinized_states": 10000,
    "enable_position_increments": true,
    "fuzziness": "AUTO",
    "fuzzy_prefix_length": 0,
    "fuzzy_max_expansions": 50,
    "phrase_slop": 0,
    "escape": false,
    "auto_generate_synonyms_phrase_query": true,
    "fuzzy_transpositions": true,
    "boost": 1
 }

Is inefficient and runs for many seconds, especially when there are many fields in the index. When there are also many indices in the query, this can take down the cluster.

Expected behavior

This is one clause of a larger query, but because of the way queries are processed, there's no way to force the filtering to occur first. In this case, we can't use a post-filter, because we need accurate values in the aggregations.

We'd like to reduce the overall cost of running the query.

@Jon-AtAWS Jon-AtAWS added bug Something isn't working untriaged labels Nov 5, 2022
@dblock
Copy link
Member

dblock commented Nov 7, 2022

@Jon-AtAWS Do you have any suggestions of how we could express the order of filtering we'd like in this query to let it happen both ways?

@Jon-AtAWS
Copy link
Member Author

So, let's say this is part of a larger query, like (p-code, excuse the lack of syntax)

bool
  filter
    bool
      must
        some time range
        some field filter
        simple-query-string: regex

And, let's say I know that the time range and field filter will reduce the match set to a few hundred docs, vs. evaluating the cost of the regex as part of the query planning.

Just brainstorming...

  1. We could add an "execution_order": or similar to the clauses. This is likely too hard to guarantee
  2. We could add something like "weight": that gets passed in to Lucene to either substitute for the cost function or modify it
  3. We could add something time related like "stop_after":

We can also work on how Lucene approximates regex queries, to increase their weight an order of magnitude or 2.

@anasalkouz anasalkouz added enhancement Enhancement or improvement to existing feature or request Performance This is for any performance related enhancements or bugs and removed untriaged bug Something isn't working labels Nov 8, 2022
@msfroh
Copy link
Collaborator

msfroh commented Sep 17, 2023

This may benefit from #7057.

Alternatively, it might have already improved in newer versions of Lucene, since (IIRC) it no longer precomputes a bitset of all matching docs across all fields.

@Jon-AtAWS
Copy link
Member Author

The release of search pipelines fixes this problem in one way. We should evaluate the cost of splitting out filters like the above and managing pipeline flow vs. pushing them all into a single query and having Lucene optimize.

@msfroh
Copy link
Collaborator

msfroh commented May 21, 2024

We're also adding support for the wildcard field type to make these queries better (if you plan your mappings around wildcard matching): #5639

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing & Search Performance This is for any performance related enhancements or bugs
Development

No branches or pull requests

5 participants