-
-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Levenshtein distance to score documents in fuzzy term queries #998
Conversation
let const_scorer = ConstScorer::new(doc_bitset, boost); | ||
Ok(Box::new(const_scorer)) | ||
|
||
let scorer = Union::<_, SumCombiner>::from(scorers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this sum up the score of all ConstScorer
s and report the sum as score for each document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, yes. So if a doc contains japan
and japin
and we search for japon
, then its final score will be 0.5 + 0.5 = 1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I did not understand this method correctly then. I though it would return a scorer for multiple different documents, and in this case sum the score of different documents.
let score = automaton_score(self.automaton.as_ref(), state); | ||
let segment_postings = | ||
inverted_index.read_postings_from_terminfo(term_info, IndexRecordOption::Basic)?; | ||
let scorer = ConstScorer::new(segment_postings, boost * score); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is impractical.
It may end up being a list of way too many scorers.
You should probably use a TAAT (term at a time) strategy here.
Let's say you define you know how to compute the score associate to one term match, as function of its doc_freq, term_freq and most importantly the distance (note that the doc_freq alone is not necessarily the best indication...
If I am at levenshtein 1 of two words... The one with the highest document frequency is probably the one that I was shooting for).
You probably want the score for the fuzzytermquery to be the max of the term score of all of the term that were found in the document. Not the sum.
So you could keep the bitset that was there before.
In addition it, you could open a Vec<Score>
with a len equal to max doc and intialize it to 0.
As you go throught he terms, you then update the score of the doc that match.
score[doc] = max(score[doc], termscore(lev_distance, term_frequency, doc_frequency, ...))
Once the Bitset and the Vector of score has been populated, you can return a Scorer that iterates through the bitset and returns the computed scores. You can for instance implement a wrapper around the BitSetDocSet...
struct BitSetDocSetWithPrecomputedScore {
underlying_bitset: BitSetDocSet
scores: Vec<Score>
}
impl DocSet {
....
}
impl Scorer {
fn score(&self) -> Score {
let doc = self.doc();
self.scores[doc as usize]
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AutomatonWeight is used by more than the fuzzy term Query.
For the regex query for isntance, this extra vec with scores is too big an overhead. Please find a solution to avoid the extra cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that the doc_id
were not particularly between 0
and max_doc
, so I didn't consider using a score vector like scores[doc as usize]
, was I wrong or should it be a HashMap
instead of a Vec
?
As for the scoring formula, I see how to get doc_freq
from term_info
for the current segment but I still need to sum those for across all segments, right ? And as we have only access to a specific SegmentReader
here, how do we get the others without a Searcher
?
Moreover, to get the term_freq
I still need the SegmentPostings
so I don't see how to conciliate that with the Bitset
/BitSetDocSet
previous approach. All this made me use the simple 1 / (1 + dist)
scoring instead.
Finally, to avoid impacting regex queries I think we can get back to the previous implementation for the generic case, and just have a specialized impl Weigth for Automaton<DFAWrapper>
as @maufl suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that the doc_id were not particularly between 0 and max_doc
You are guaranteed that doc_id < max_doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scoring part seems very difficult. I commented on the issue.
I'd like us to take a step back and think longer on what is the spec we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my two stabs at the score problem.
I think it might be useful to approach the question of "how do you adjust scores in fuzzy searching" by asking "how likely was it that the initial query term was really an incorrect version of another term". If you knew the probability of an error you could weight the score of each fuzzy term score_adj=score*P(T|Q)
where P(T|Q)
is "the probability that term T is being searched for given query Q". Just multiplying P(T|Q)
has issues (floating point error, few high probability events tend to dominate, etc) so -log(P(T_i))
or I(T_i)
(the Shanon information) should be used instead. This leaves us with score_ajusted=score*(1/I(T_i))
(I'm not entirely sure this is correct). This leaves the question of how do you get P(T_i)
or I(T_i)
. Long story short, I(T_i) can approximated by levenshtein distance. This leaves use at score_adj=score*(1/(α + dist))
where α is I(Q) (the information given that an initial query with levenshtein distance=0
is correct). It just so happens that score_adj=score*1 / (1 + dist)
is valid here! α=1
Implies that the probability a Lev dist of 0 is 10%, α=.04575749056
would correspond to a 90% probability.
Another approach would be to try and fix the scoring function itself. BM25 uses inverse document frequency to compensate for words that rarely show up in documents. One interpretation is that IDF function is a stand in for the amount of information a document containing a term conveys. As explained above LV dist can represent the information that a search being incorrect conveys. To incorperate Lev dist into BM25, it might be useful to investigate the effect of subtracting LV dist from the LDF:
score(D,Q) = Sum_i_n( (IDF(q_i) - LevDist(q_i)) * (f(q_i,D)*(k + 1))/(f(q_i, D) +k*( 1- b + b(|D|/avgdl)))
This would leave the score of dist=0 terms unchanged and would lower scores of results with higher Lev dist.
|
||
/// `TermWithStateStreamerBuilder` is a helper object used to define | ||
/// a range of terms that should be streamed. | ||
pub struct TermWithStateStreamerBuilder<'a, A = AlwaysMatch> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok to just have a TermWithStateStreamer and avoid the code duplication.
Most of the time the state is small and "Copy".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is it OK if I remove the extra TermWithState*
structs I added, and instead add the state to all the TermStreamer*
structs as @maufl also suggested ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
let id = unsafe { *id }; | ||
|
||
let dist = dfa.0.distance(id).to_u8() as f32; | ||
1.0 / (1.0 + dist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document this score somewhere in the doc of the FuzzyTermScorer?
Ideally I think this should be implemented as an array [f32; 5]
that is passed by the user. (5 is sufficient. We do not handle any levenshtein distance above 4.)
I'm ok with [1f32, 0.5f32, 0.3333f32, ...]
as a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure, isn't max distance levenshtein 2 ? In this range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or the transposition below multiplies it by 2 again ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes you are right!
The limit of 4 is within the levenshtein automata library. I reduced it in tantivy.
0c60a44
to
9f32b22
Compare
This is not a topic I am well versed in but is this PR using beyong a regular Levenshtein distance ranking, or a more elaborate Levenshtein automaton ?, if the latter, then this is the strategy used by Lucene |
it is using a levenshtein automaton. |
Nice to hear! |
e550a98
to
84e0c75
Compare
Fix for #563