Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Levenshtein distance to score documents in fuzzy term queries #998

Closed
wants to merge 2 commits into from
Closed

Use Levenshtein distance to score documents in fuzzy term queries #998

wants to merge 2 commits into from

Conversation

lerouxrgd
Copy link

Fix for #563

let const_scorer = ConstScorer::new(doc_bitset, boost);
Ok(Box::new(const_scorer))

let scorer = Union::<_, SumCombiner>::from(scorers);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this sum up the score of all ConstScorers and report the sum as score for each document?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, yes. So if a doc contains japan and japin and we search for japon, then its final score will be 0.5 + 0.5 = 1.0

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I did not understand this method correctly then. I though it would return a scorer for multiple different documents, and in this case sum the score of different documents.

@lerouxrgd
Copy link
Author

In this comment @maufl suggested an alternative specialization based implementation to address the A::State type issue. I can update this PR with it if needed.

let score = automaton_score(self.automaton.as_ref(), state);
let segment_postings =
inverted_index.read_postings_from_terminfo(term_info, IndexRecordOption::Basic)?;
let scorer = ConstScorer::new(segment_postings, boost * score);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impractical.
It may end up being a list of way too many scorers.

You should probably use a TAAT (term at a time) strategy here.
Let's say you define you know how to compute the score associate to one term match, as function of its doc_freq, term_freq and most importantly the distance (note that the doc_freq alone is not necessarily the best indication...
If I am at levenshtein 1 of two words... The one with the highest document frequency is probably the one that I was shooting for).

You probably want the score for the fuzzytermquery to be the max of the term score of all of the term that were found in the document. Not the sum.

So you could keep the bitset that was there before.
In addition it, you could open a Vec<Score> with a len equal to max doc and intialize it to 0.

As you go throught he terms, you then update the score of the doc that match.

score[doc] = max(score[doc], termscore(lev_distance, term_frequency, doc_frequency, ...))

Once the Bitset and the Vector of score has been populated, you can return a Scorer that iterates through the bitset and returns the computed scores. You can for instance implement a wrapper around the BitSetDocSet...

struct BitSetDocSetWithPrecomputedScore {
   underlying_bitset: BitSetDocSet
   scores: Vec<Score>
}

impl DocSet {
....
}

impl Scorer {
   fn score(&self) -> Score {
     let doc = self.doc();
      self.scores[doc as usize]
   }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AutomatonWeight is used by more than the fuzzy term Query.

For the regex query for isntance, this extra vec with scores is too big an overhead. Please find a solution to avoid the extra cost.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that the doc_id were not particularly between 0 and max_doc, so I didn't consider using a score vector like scores[doc as usize], was I wrong or should it be a HashMap instead of a Vec ?

As for the scoring formula, I see how to get doc_freq from term_info for the current segment but I still need to sum those for across all segments, right ? And as we have only access to a specific SegmentReader here, how do we get the others without a Searcher ?

Moreover, to get the term_freq I still need the SegmentPostings so I don't see how to conciliate that with the Bitset/BitSetDocSet previous approach. All this made me use the simple 1 / (1 + dist) scoring instead.

Finally, to avoid impacting regex queries I think we can get back to the previous implementation for the generic case, and just have a specialized impl Weigth for Automaton<DFAWrapper> as @maufl suggested.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that the doc_id were not particularly between 0 and max_doc

You are guaranteed that doc_id < max_doc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scoring part seems very difficult. I commented on the issue.

I'd like us to take a step back and think longer on what is the spec we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my two stabs at the score problem.

I think it might be useful to approach the question of "how do you adjust scores in fuzzy searching" by asking "how likely was it that the initial query term was really an incorrect version of another term". If you knew the probability of an error you could weight the score of each fuzzy term score_adj=score*P(T|Q) where P(T|Q) is "the probability that term T is being searched for given query Q". Just multiplying P(T|Q) has issues (floating point error, few high probability events tend to dominate, etc) so -log(P(T_i)) or I(T_i) (the Shanon information) should be used instead. This leaves us with score_ajusted=score*(1/I(T_i)) (I'm not entirely sure this is correct). This leaves the question of how do you get P(T_i) or I(T_i). Long story short, I(T_i) can approximated by levenshtein distance. This leaves use at score_adj=score*(1/(α + dist)) where α is I(Q) (the information given that an initial query with levenshtein distance=0 is correct). It just so happens that score_adj=score*1 / (1 + dist) is valid here! α=1 Implies that the probability a Lev dist of 0 is 10%, α=.04575749056 would correspond to a 90% probability.

Another approach would be to try and fix the scoring function itself. BM25 uses inverse document frequency to compensate for words that rarely show up in documents. One interpretation is that IDF function is a stand in for the amount of information a document containing a term conveys. As explained above LV dist can represent the information that a search being incorrect conveys. To incorperate Lev dist into BM25, it might be useful to investigate the effect of subtracting LV dist from the LDF:

score(D,Q) = Sum_i_n( (IDF(q_i) - LevDist(q_i)) * (f(q_i,D)*(k + 1))/(f(q_i, D) +k*( 1- b + b(|D|/avgdl)))

This would leave the score of dist=0 terms unchanged and would lower scores of results with higher Lev dist.


/// `TermWithStateStreamerBuilder` is a helper object used to define
/// a range of terms that should be streamed.
pub struct TermWithStateStreamerBuilder<'a, A = AlwaysMatch>
Copy link
Collaborator

@fulmicoton fulmicoton Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to just have a TermWithStateStreamer and avoid the code duplication.

Most of the time the state is small and "Copy".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it OK if I remove the extra TermWithState* structs I added, and instead add the state to all the TermStreamer* structs as @maufl also suggested ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

let id = unsafe { *id };

let dist = dfa.0.distance(id).to_u8() as f32;
1.0 / (1.0 + dist)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this score somewhere in the doc of the FuzzyTermScorer?

Ideally I think this should be implemented as an array [f32; 5] that is passed by the user. (5 is sufficient. We do not handle any levenshtein distance above 4.)

I'm ok with [1f32, 0.5f32, 0.3333f32, ...] as a default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, isn't max distance levenshtein 2 ? In this range

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or the transposition below multiplies it by 2 again ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes you are right!
The limit of 4 is within the levenshtein automata library. I reduced it in tantivy.

@LifeIsStrange
Copy link

LifeIsStrange commented Nov 19, 2021

This is not a topic I am well versed in but is this PR using beyong a regular Levenshtein distance ranking, or a more elaborate Levenshtein automaton ?, if the latter, then this is the strategy used by Lucene

@fulmicoton
Copy link
Collaborator

it is using a levenshtein automaton.

@LifeIsStrange
Copy link

LifeIsStrange commented Nov 19, 2021

Nice to hear!
Also this might be a useful analysis?
see also a faster alternative to levenshtein automaton?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants