Closed
Description
While wildcard queries are often a symptom that content hasn't been properly analyzed, there are also cases when users really need the flexibility of wildcard queries. Could we leverage ngrams to make this kind of queries faster, similarly to how we made prefix
queries faster on text
fields by indexing edge ngrams?
I tried to think about how this could work and here is an idea:
- Null bytes would be used as start/end markers in order to make suffix queries possible (
*foobar
) and prefix queries faster (foobar*
). - Content would be indexed with a ngram tokenizer that has a fixed gram size, e.g. 5 (could be configurable). It would be used to return a good approximation of the matches of the wildcard query.
- Doc values would store the original value and could be used for a two-phase verification.
Here are some example queries and how they would run internally with this setup:
Query | Approximation | Two-phase verification | Note |
---|---|---|---|
*foobar* |
fooba AND oobar |
Check positions | Like a phrase query |
foobar* |
\0foob AND fooba AND oobar |
Check positions | We could skip the middle term like Lucene's NGramPhraseQuery |
*foobar |
fooba AND oobar AND obar\0 |
Check positions | We could skip the middle term like Lucene's NGramPhraseQuery |
*foo* |
*foo* |
Always returns true | Need a wildcard query because the substring is shorter than the gram size |
foo* |
\0foo* |
Always returns true | Need a wildcard query because the substring is shorter than the gram size |
*foo |
*foo\0 |
Always returns true | Need a wildcard query because the substring is shorter than the gram size |
foobar |
\0foob AND fooba AND oobar AND obar\0 |
Check positions | We could skip the middle terms like Lucene's NGramPhraseQuery |
*foobar*quux* |
fooba AND oobar |
Check doc values | We don't want to run a wildcard query on *quux* , the approximation we have might be selective enough already |
Notes:
- This might be quite space-intensive on long fields.
Questions:
- Are there other ways it could work?
- What is a good default gram size?
- How much more space does it require compared to
keyword
indexing? - Should it be an option on
text
/keyword
fields or a new field? This way of indexing allows to run exact queries too so we wouldn't need to index as a keyword too in addition to these ngrams. - Should we index positions? Whenever we check positions above, we could also check doc values instead.
- Should we use BINARY or SORTED doc values? BINARY feels more appropriate for the access pattern described above, but compression would be terrible.