Use Levenshtein distance to score documents in fuzzy term queries #998

lerouxrgd · 2021-03-23T19:18:49Z

Fix for #563

maufl · 2021-03-23T19:40:08Z

src/query/automaton_weight.rs

-        let const_scorer = ConstScorer::new(doc_bitset, boost);
-        Ok(Box::new(const_scorer))
+
+        let scorer = Union::<_, SumCombiner>::from(scorers);


Will this sum up the score of all ConstScorers and report the sum as score for each document?

As I understand it, yes. So if a doc contains japan and japin and we search for japon, then its final score will be 0.5 + 0.5 = 1.0

Maybe I did not understand this method correctly then. I though it would return a scorer for multiple different documents, and in this case sum the score of different documents.

lerouxrgd · 2021-03-23T19:42:42Z

In this comment @maufl suggested an alternative specialization based implementation to address the A::State type issue. I can update this PR with it if needed.

fulmicoton · 2021-03-25T12:35:39Z

src/query/automaton_weight.rs

+            let score = automaton_score(self.automaton.as_ref(), state);
+            let segment_postings =
+                inverted_index.read_postings_from_terminfo(term_info, IndexRecordOption::Basic)?;
+            let scorer = ConstScorer::new(segment_postings, boost * score);


This is impractical.
It may end up being a list of way too many scorers.

You should probably use a TAAT (term at a time) strategy here.
Let's say you define you know how to compute the score associate to one term match, as function of its doc_freq, term_freq and most importantly the distance (note that the doc_freq alone is not necessarily the best indication...
If I am at levenshtein 1 of two words... The one with the highest document frequency is probably the one that I was shooting for).

You probably want the score for the fuzzytermquery to be the max of the term score of all of the term that were found in the document. Not the sum.

So you could keep the bitset that was there before.
In addition it, you could open a Vec<Score> with a len equal to max doc and intialize it to 0.

As you go throught he terms, you then update the score of the doc that match.

score[doc] = max(score[doc], termscore(lev_distance, term_frequency, doc_frequency, ...))

Once the Bitset and the Vector of score has been populated, you can return a Scorer that iterates through the bitset and returns the computed scores. You can for instance implement a wrapper around the BitSetDocSet...

struct BitSetDocSetWithPrecomputedScore { underlying_bitset: BitSetDocSet scores: Vec<Score> } impl DocSet { .... } impl Scorer { fn score(&self) -> Score { let doc = self.doc(); self.scores[doc as usize] } }

The AutomatonWeight is used by more than the fuzzy term Query.

For the regex query for isntance, this extra vec with scores is too big an overhead. Please find a solution to avoid the extra cost.

My understanding was that the doc_id were not particularly between 0 and max_doc, so I didn't consider using a score vector like scores[doc as usize], was I wrong or should it be a HashMap instead of a Vec ?

As for the scoring formula, I see how to get doc_freq from term_info for the current segment but I still need to sum those for across all segments, right ? And as we have only access to a specific SegmentReader here, how do we get the others without a Searcher ?

Moreover, to get the term_freq I still need the SegmentPostings so I don't see how to conciliate that with the Bitset/BitSetDocSet previous approach. All this made me use the simple 1 / (1 + dist) scoring instead.

Finally, to avoid impacting regex queries I think we can get back to the previous implementation for the generic case, and just have a specialized impl Weigth for Automaton<DFAWrapper> as @maufl suggested.

My understanding was that the doc_id were not particularly between 0 and max_doc

You are guaranteed that doc_id < max_doc.

The scoring part seems very difficult. I commented on the issue.

I'd like us to take a step back and think longer on what is the spec we want.

Here's my two stabs at the score problem.

I think it might be useful to approach the question of "how do you adjust scores in fuzzy searching" by asking "how likely was it that the initial query term was really an incorrect version of another term". If you knew the probability of an error you could weight the score of each fuzzy term score_adj=score*P(T|Q) where P(T|Q) is "the probability that term T is being searched for given query Q". Just multiplying P(T|Q) has issues (floating point error, few high probability events tend to dominate, etc) so -log(P(T_i)) or I(T_i) (the Shanon information) should be used instead. This leaves us with score_ajusted=score*(1/I(T_i)) (I'm not entirely sure this is correct). This leaves the question of how do you get P(T_i) or I(T_i). Long story short, I(T_i) can approximated by levenshtein distance. This leaves use at score_adj=score*(1/(α + dist)) where α is I(Q) (the information given that an initial query with levenshtein distance=0 is correct). It just so happens that score_adj=score*1 / (1 + dist) is valid here! α=1 Implies that the probability a Lev dist of 0 is 10%, α=.04575749056 would correspond to a 90% probability.

Another approach would be to try and fix the scoring function itself. BM25 uses inverse document frequency to compensate for words that rarely show up in documents. One interpretation is that IDF function is a stand in for the amount of information a document containing a term conveys. As explained above LV dist can represent the information that a search being incorrect conveys. To incorperate Lev dist into BM25, it might be useful to investigate the effect of subtracting LV dist from the LDF:

score(D,Q) = Sum_i_n( (IDF(q_i) - LevDist(q_i)) * (f(q_i,D)*(k + 1))/(f(q_i, D) +k*( 1- b + b(|D|/avgdl)))

This would leave the score of dist=0 terms unchanged and would lower scores of results with higher Lev dist.

fulmicoton · 2021-03-25T12:53:47Z

src/termdict/fst_termdict/streamer.rs

+
+/// `TermWithStateStreamerBuilder` is a helper object used to define
+/// a range of terms that should be streamed.
+pub struct TermWithStateStreamerBuilder<'a, A = AlwaysMatch>


I think it is ok to just have a TermWithStateStreamer and avoid the code duplication.

Most of the time the state is small and "Copy".

So is it OK if I remove the extra TermWithState* structs I added, and instead add the state to all the TermStreamer* structs as @maufl also suggested ?

fulmicoton · 2021-03-25T12:59:48Z

src/query/automaton_weight.rs

+        let id = unsafe { *id };
+
+        let dist = dfa.0.distance(id).to_u8() as f32;
+        1.0 / (1.0 + dist)


Can we document this score somewhere in the doc of the FuzzyTermScorer?

Ideally I think this should be implemented as an array [f32; 5] that is passed by the user. (5 is sufficient. We do not handle any levenshtein distance above 4.)

I'm ok with [1f32, 0.5f32, 0.3333f32, ...] as a default.

Just to be sure, isn't max distance levenshtein 2 ? In this range

Or the transposition below multiplies it by 2 again ?

Ah yes you are right!
The limit of 4 is within the levenshtein automata library. I reduced it in tantivy.

LifeIsStrange · 2021-11-19T01:20:40Z

This is not a topic I am well versed in but is this PR using beyong a regular Levenshtein distance ranking, or a more elaborate Levenshtein automaton ?, if the latter, then this is the strategy used by Lucene

fulmicoton · 2021-11-19T01:32:59Z

it is using a levenshtein automaton.

LifeIsStrange · 2021-11-19T01:36:22Z

Nice to hear!
Also this might be a useful analysis?
see also a faster alternative to levenshtein automaton?

lerouxrgd added 2 commits March 23, 2021 20:05

Bubble up Automaton::State in AutomatonWeight::scorer

4f0575f

Use Levenshtein distance to score documents in fuzzy queries

6d1e622

maufl reviewed Mar 23, 2021

View reviewed changes

fulmicoton reviewed Mar 25, 2021

View reviewed changes

fulmicoton force-pushed the main branch 4 times, most recently from 0c60a44 to 9f32b22 Compare August 26, 2021 00:07

joepio mentioned this pull request Nov 12, 2021

Various search improvement suggestions atomicdata-dev/atomic-server#210

Open

7 tasks

Restioson mentioned this pull request Jan 1, 2022

Levenshtein distance score with custom score function (POC for #998) #1244

Draft

fulmicoton force-pushed the main branch 2 times, most recently from e550a98 to 84e0c75 Compare September 2, 2022 03:01

rebasedming mentioned this pull request Sep 20, 2023

Implement a scorer for fuzzy term queries paradedb/paradedb#309

Closed

neilyio mentioned this pull request Oct 3, 2023

Use Levenshtein distance to score documents in fuzzy term queries paradedb/tantivy#1

Merged

lerouxrgd closed this by deleting the head repository Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Levenshtein distance to score documents in fuzzy term queries #998

Use Levenshtein distance to score documents in fuzzy term queries #998

lerouxrgd commented Mar 23, 2021

maufl Mar 23, 2021

lerouxrgd Mar 23, 2021

maufl Mar 23, 2021

lerouxrgd commented Mar 23, 2021

fulmicoton Mar 25, 2021

fulmicoton Mar 25, 2021

lerouxrgd Mar 25, 2021

fulmicoton Mar 29, 2021

fulmicoton Mar 29, 2021

liamwarfield Apr 20, 2021

fulmicoton Mar 25, 2021 •

edited

Loading

lerouxrgd Mar 25, 2021

fulmicoton Mar 29, 2021

fulmicoton Mar 25, 2021

lerouxrgd Mar 25, 2021

lerouxrgd Mar 25, 2021

fulmicoton Mar 29, 2021

LifeIsStrange commented Nov 19, 2021 •

edited

Loading

fulmicoton commented Nov 19, 2021

LifeIsStrange commented Nov 19, 2021 •

edited

Loading

Use Levenshtein distance to score documents in fuzzy term queries #998

Use Levenshtein distance to score documents in fuzzy term queries #998

Conversation

lerouxrgd commented Mar 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lerouxrgd commented Mar 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fulmicoton Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LifeIsStrange commented Nov 19, 2021 • edited Loading

fulmicoton commented Nov 19, 2021

LifeIsStrange commented Nov 19, 2021 • edited Loading

fulmicoton Mar 25, 2021 •

edited

Loading

LifeIsStrange commented Nov 19, 2021 •

edited

Loading

LifeIsStrange commented Nov 19, 2021 •

edited

Loading