-
-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
score fuzzy term by distance #563
Comments
You are absolutely correct. So a simple solution would be to change the score of the fuzzytermquery depending on the distance it hit. Bad news it was never implemented because the fst automaton did not expose the automaton state when streaming over matching terms. We simply need to merge it in tantivy-fst and tweak the score in the fuzzytermquery to solve your issue. Are you interested in the ticket? |
Yes, i'm interested in this :) |
Rephrasing. Are you interesting in addressing this ticket? Do you need pointers to the stuff I was talking about above? |
I didn't understand that,but if you can point out the relevant pieces, I'd give it a try. |
no problem I can search for volunteers on the chat :) |
Just wondering: How is this going? Is there a branch for this feature? Also, where is the fst crate fork and the PR? |
The PR for fst was since deleted, I can't find the commit anywhere, so it'll have to ported over by hand. This is gonna take some time. I probably won't have much time to look at this in the next few days, but I'll take to the grind a bit later in the week. |
@fulmicoton how do we want to tackle this PR for fst? I don't see much way other than implementing it by hand from that PR, which I had to dig up, but the commit itself is gone. |
@hntd187 Can you ask his author? He is nice and usually replies really fast. Also if you do not rewrite this code yourself, make sure that there is no license conflict and he gets the credit somewhere and his happy with the visibility of this credit. |
@hntd187 is it pr 61? If so, you can get the commit back by running |
So @fulmicoton we got this in fst now, we're all ready to hook it up in tantivy right? I'll get that started if you have any more thoughts from out gitter convo |
Could this be solved by leveraging |
If my understanding is correct, here are the things to do:
That last step isn't completely clear for me, however do these steps look good ? |
@lerouxrgd This is a perfect summary of what needs to done. The last step is not too difficult. The state can be converted into a levenshtein distance. Scoring given |
@lerouxrgd Are you planning to work on it? I was thinking on giving it a try when I understand everything. What I'm currently not understanding is, the |
@maufl Generally the scorer is kind of an iterator, on the sorted document set. Due to the strange nature of queries based on the automatonweight, the docset is actually entirely precalculated and stored into a bitset, and this is the iterator that is returned. The iterator is not a rust Iterator... |
@fulmicoton Understood, then I think my intuition was right. @lerouxrgd I think instead of implementing |
@maufl Yes good point, it might be better to use @fulmicoton Thank you for confirming this is correct. Besides, what would prevent regular BM25 scoring in |
@maufl FYI I implemented the steps I listed that bubble up |
@lerouxrgd That looks very good. I was thinking about the scoring and scorer. What about calculating the score as |
@maufl Yes, I was thinking of the same score. Actually the tricky part is still to get the automaton state because all we have in the API is |
@lerouxrgd Alternatively you can implement |
Oh yes good idea, let's see what the maintainers think is best. |
I'm also unsure whether having an extra |
Overall the PR looks good except one very important detail, I commented on the PR. Good job! The hardest part is done... |
@lerouxrgd @maufl Following on the PR. As we are discovering scoring really is tricky... The more I think about it, the more I feel incorporating the lev distance with a score is hard. The best I can think for a given query term score, is to pick the minimum distance observe and take the max BM25 score for the term at this distance. In many ways, a more satisfying to implement fuzzy search would be to use a algolia like approach. First: Do all of the terms match perfectly? If not enough doc are matching, or there is a tie, then try with It makes sense both as great trade off between search quality and usability, and for performance reasons. The routine would just be in charge of computing a TopK.
The routine would work actually work like the policy above. Instead of trying to score all of the docs with a levenshtein 2 and pay a massive price when it is most likely not useful, try distance 0 first... and then increment from there. In algolia, there are more criteria than just lev distance and bm25, and their order can be configured with a Policy. Do you have projects who would benefit from this PR ? @lerouxrgd @maufl In its current shape, I suspect the approach in the PR will be not be useful to anyone and add to tantivy's tech debt. (Apologies for not seeing this coming earlier) |
I have a project that is a product search for local shops. I want to build a database with the offers of the shops in my city and have a full text search for it. The search should allow fuzzy searching, i.e. return results even if there is a small typo, but I want exact matches to appear first. |
@maufl sweet. Do you have an idea of what is the correct order for these results? Do you want to see a result count and facetting, and how do you define that with a fuzzy query? |
@fulmicoton (Disclaimer, I don't know much about full text search in general) I'd like to sort the results based on how good the match was, i.e. whether it was an exact match or not, and maybe on the number or matches in one document. Furthermore, I might like to give matches in a certain field a higher weight, i.e. a match in "name" is better than in "description". However, I did not try to actually implement this yet. Right now I'm building a query like this: fn make_query(&self, query_string: &str) -> Box<dyn Query> {
let mut parts: Vec<(Occur, Box<dyn Query>)> = Vec::new();
let fields = vec![
self.name_field,
self.description_field,
self.category_field,
self.color_field,
];
for search_term in query_string.to_ascii_lowercase().split(" ") {
if search_term.is_empty() {
continue;
}
for field in fields.iter() {
parts.push((
Occur::Should,
Box::new(FuzzyTermQuery::new_prefix(
Term::from_field_text(*field, search_term),
1,
true,
)),
))
}
}
Box::new(BooleanQuery::from(parts))
} Then I use the |
@fulmicoton I am not using it directly now, so as far as I am concerned it is fine to discard this PR. I see that correct scoring is indeed tricky, however I still feel that just ordering results with |
Is there a path forward for this issue? I have a use case that really needs fuzzy search on multiple terms to handle user typos and this issue is a blocker. |
When playing around with tantivy, i came up with a query including multiple fuzzy term queries.
Unvortunately the results don't make much sense (e.g. imagine a text field "the quick brown fox" and some other completely different text fields and then do a boolean query with multiple fuzzy terms. The results are completely unusable, because even an exact search term results in some completely different term scored the same).
I think it comes down to this: The single fuzzy term query doesn't make a difference between exact matches or some levenshtein distance.
The test should find the second added document first (with a higher score), but doesn't.
Somehow the distance needs to be considered in the scoring, else the fuzzy query doesn't make much sense.
Or is there some other way to achieve this, i didn't see?
The text was updated successfully, but these errors were encountered: