Skip to content

Faster wildcard search #48852

Closed
Closed
@jpountz

Description

@jpountz

While wildcard queries are often a symptom that content hasn't been properly analyzed, there are also cases when users really need the flexibility of wildcard queries. Could we leverage ngrams to make this kind of queries faster, similarly to how we made prefix queries faster on text fields by indexing edge ngrams?

I tried to think about how this could work and here is an idea:

  • Null bytes would be used as start/end markers in order to make suffix queries possible (*foobar) and prefix queries faster (foobar*).
  • Content would be indexed with a ngram tokenizer that has a fixed gram size, e.g. 5 (could be configurable). It would be used to return a good approximation of the matches of the wildcard query.
  • Doc values would store the original value and could be used for a two-phase verification.

Here are some example queries and how they would run internally with this setup:

Query Approximation Two-phase verification Note
*foobar* fooba AND oobar Check positions Like a phrase query
foobar* \0foob AND fooba AND oobar Check positions We could skip the middle term like Lucene's NGramPhraseQuery
*foobar fooba AND oobar AND obar\0 Check positions We could skip the middle term like Lucene's NGramPhraseQuery
*foo* *foo* Always returns true Need a wildcard query because the substring is shorter than the gram size
foo* \0foo* Always returns true Need a wildcard query because the substring is shorter than the gram size
*foo *foo\0 Always returns true Need a wildcard query because the substring is shorter than the gram size
foobar \0foob AND fooba AND oobar AND obar\0 Check positions We could skip the middle terms like Lucene's NGramPhraseQuery
*foobar*quux* fooba AND oobar Check doc values We don't want to run a wildcard query on *quux*, the approximation we have might be selective enough already

Notes:

  • This might be quite space-intensive on long fields.

Questions:

  • Are there other ways it could work?
  • What is a good default gram size?
  • How much more space does it require compared to keyword indexing?
  • Should it be an option on text/keyword fields or a new field? This way of indexing allows to run exact queries too so we wouldn't need to index as a keyword too in addition to these ngrams.
  • Should we index positions? Whenever we check positions above, we could also check doc values instead.
  • Should we use BINARY or SORTED doc values? BINARY feels more appropriate for the access pattern described above, but compression would be terrible.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions