This repository was archived by the owner on Apr 4, 2023. It is now read-only.
Avoid a prefix-related worst-case scenario in the proximity criterion #733
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request
Related issue
Somewhat fixes (until merged into meilisearch) meilisearch/meilisearch#3118
What does this PR do?
When a query ends with a word and a prefix, such as:
Then we first determine whether
precould possibly be in the proximity prefix database before querying it. There are then three possibilities:pris not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations ofprethrough the FST and query the regular proximity databases.pris in the prefix cache but cannot be found in the proximity prefix databases. In this case, we partially disable the proximity ranking rule for the pairword pre. This is done as follows:wordis in proximity topreexactly (no derivations)pris in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases.Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is:
Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query
heard pr(IFpris the prefix of more than 200 words in the dataset):[ { "text": "I heard there is a faster proximity criterion" }, { "text": "I heard there is a faster but less relevant proximity criterion" } ]Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro":
[ { "text": "I heard there is a faster but less relevant proximity criterion" } { "text": "I heard there is a faster proximity criterion" }, ]But the following document would be considered more relevant than the two documents above:
{ "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " }Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything.
Performance
I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the
songsbenchmark dataset.Performance is often significantly better, but there is also one regression in the set-based implementation with the query
b b b b b b b b b b.