Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

score fuzzy term by distance #563

Open
buster opened this issue Jun 10, 2019 · 31 comments
Open

score fuzzy term by distance #563

buster opened this issue Jun 10, 2019 · 31 comments
Assignees
Milestone

Comments

@buster
Copy link

buster commented Jun 10, 2019

When playing around with tantivy, i came up with a query including multiple fuzzy term queries.
Unvortunately the results don't make much sense (e.g. imagine a text field "the quick brown fox" and some other completely different text fields and then do a boolean query with multiple fuzzy terms. The results are completely unusable, because even an exact search term results in some completely different term scored the same).
I think it comes down to this: The single fuzzy term query doesn't make a difference between exact matches or some levenshtein distance.
The test should find the second added document first (with a higher score), but doesn't.

#[cfg(test)]
mod test {
    use tantivy::query::FuzzyTermQuery;
    use tantivy::collector::TopDocs;
    use tantivy::schema::Schema;
    use tantivy::schema::{TEXT, STORED};
    use tantivy::Index;
    use tantivy::Term;

    #[test]
    pub fn test_fuzzy_term() {
        let mut schema_builder = Schema::builder();
        let country_field = schema_builder.add_text_field("country", TEXT | STORED);
        let schema = schema_builder.build();
        let index = Index::create_in_ram(schema);
        {
            let mut index_writer = index.writer_with_num_threads(1, 10_000_000).unwrap();
            index_writer.add_document(doc!(
                country_field => "WENN ROT WIE RUBIN",
            ));
            index_writer.add_document(doc!(
                country_field => "WENN ROT WIE ROBIN",
            ));
            index_writer.commit().unwrap();
        }
        let reader = index.reader().unwrap();
        let searcher = reader.searcher();
        {
            let term = Term::from_field_text(country_field, "robin");

            let fuzzy_query = FuzzyTermQuery::new(term, 2, true);
            let top_docs = searcher
                .search(&fuzzy_query, &TopDocs::with_limit(100))
                .unwrap();
            assert_eq!(top_docs.len(), 2, "Expected 2 documents");
            let (score, adr) = top_docs[0];
            let document = index.schema().to_named_doc(&searcher.doc(adr).expect("document"));
            let json = index.schema().to_json(&searcher.doc(adr).expect("document"));
            println!("{}", json);
            println!("{:?}", top_docs);
            assert_eq!(document.0.get("country").expect("second document")[0].text().unwrap(), "WENN ROT WIE ROBIN");
        }
    }
}

Somehow the distance needs to be considered in the scoring, else the fuzzy query doesn't make much sense.

Or is there some other way to achieve this, i didn't see?

@fulmicoton
Copy link
Collaborator

You are absolutely correct.

So a simple solution would be to change the score of the fuzzytermquery depending on the distance it hit.

Bad news it was never implemented because the fst automaton did not expose the automaton state when streaming over matching terms.
Good news, partly for this reason I forked the fst crate and someone already wrote a PR that has never been imported in the fst crate.

We simply need to merge it in tantivy-fst and tweak the score in the fuzzytermquery to solve your issue. Are you interested in the ticket?

@buster
Copy link
Author

buster commented Jun 10, 2019

Yes, i'm interested in this :)

@fulmicoton
Copy link
Collaborator

Rephrasing. Are you interesting in addressing this ticket? Do you need pointers to the stuff I was talking about above?

@buster
Copy link
Author

buster commented Jun 14, 2019

I didn't understand that,but if you can point out the relevant pieces, I'd give it a try.

@fulmicoton
Copy link
Collaborator

no problem I can search for volunteers on the chat :)

@hntd187 hntd187 self-assigned this Jun 16, 2019
@buster
Copy link
Author

buster commented Jun 23, 2019

Just wondering: How is this going? Is there a branch for this feature? Also, where is the fst crate fork and the PR?

@hntd187
Copy link
Contributor

hntd187 commented Jun 23, 2019

The PR for fst was since deleted, I can't find the commit anywhere, so it'll have to ported over by hand. This is gonna take some time. I probably won't have much time to look at this in the next few days, but I'll take to the grind a bit later in the week.

@hntd187
Copy link
Contributor

hntd187 commented Jun 26, 2019

@fulmicoton how do we want to tackle this PR for fst? I don't see much way other than implementing it by hand from that PR, which I had to dig up, but the commit itself is gone.

@fulmicoton
Copy link
Collaborator

@hntd187 Can you ask his author? He is nice and usually replies really fast.

Also if you do not rewrite this code yourself, make sure that there is no license conflict and he gets the credit somewhere and his happy with the visibility of this credit.

@trinity-1686a
Copy link
Contributor

@hntd187 is it pr 61? If so, you can get the commit back by running git fetch origin +refs/pull/61/head && git checkout FETCH_HEAD

@hntd187
Copy link
Contributor

hntd187 commented Aug 12, 2019

So @fulmicoton we got this in fst now, we're all ready to hook it up in tantivy right? I'll get that started if you have any more thoughts from out gitter convo

@lerouxrgd
Copy link

lerouxrgd commented Mar 17, 2021

Could this be solved by leveraging StreamWithState from fst crate which is now available (details in this commit) ?

@lerouxrgd
Copy link

If my understanding is correct, here are the things to do:

That last step isn't completely clear for me, however do these steps look good ?

@fulmicoton
Copy link
Collaborator

@lerouxrgd This is a perfect summary of what needs to done.

The last step is not too difficult. The state can be converted into a levenshtein distance.
https://github.com/tantivy-search/levenshtein-automata/blob/master/src/dfa.rs#L70

Scoring given (overall number of matching docs, docfreq matching term, levenshtein_distance) is then a non-trivial problem.

@maufl
Copy link

maufl commented Mar 18, 2021

@lerouxrgd Are you planning to work on it? I was thinking on giving it a try when I understand everything.

What I'm currently not understanding is, the Scorer trait represents a set of documents (as it can be downcast to a DocSet? This part was difficult to follow in the code and documentation), but the score method returns only a single Score? It seems to me, documents matched by the same Query -> Weight -> Scorer are supposed to have the same score? You could of course return a different score, depending on the current doc() of the DocSet, but that is probably not very intuitive.

@fulmicoton
Copy link
Collaborator

@maufl Generally the scorer is kind of an iterator, on the sorted document set. Due to the strange nature of queries based on the automatonweight, the docset is actually entirely precalculated and stored into a bitset, and this is the iterator that is returned.

The iterator is not a rust Iterator...
You are meant to call advance to advance it, and then .doc() and .score() respectively give you the doc id and the score of the current document.

@maufl
Copy link

maufl commented Mar 18, 2021

@fulmicoton Understood, then I think my intuition was right.

@lerouxrgd I think instead of implementing search_with_state in fst we can use StreamBuilder::with_state as implemented by @hntd187. We could extend the TermStreamer to take a StreamWithStateand hold a current_state: A::State. This would require much less changes. The only downside is, that the TermStreamer will then always make a clone of the state.

@lerouxrgd
Copy link

@maufl Yes good point, it might be better to use StreamBuilder::with_state instead. I guess I can give all this a try, at least until the last step I listed where the real calculation happens. I am also a bit confused by the DocSet and Scorer interactions, I would need to dig into TermWeight::specialized_scorer and TermScorer to understand how to score documents of a result set differently.

@fulmicoton Thank you for confirming this is correct. Besides, what would prevent regular BM25 scoring in AutomatonWeight::scorer instead of ConstScorer with 1.0 ? Do we have enough info at this moment to calculate it through BM25Weight::for_one_term for instance ?

@lerouxrgd
Copy link

@maufl FYI I implemented the steps I listed that bubble up Automaton::State in AutomatonWeight::scorer in this branch following your advice. Now I guess the question is, how to score documents.

@maufl
Copy link

maufl commented Mar 23, 2021

@lerouxrgd That looks very good. I was thinking about the scoring and scorer. What about calculating the score as 1 / (1 + distance)? For a perfect match we would have score 1 and for a greater distance a smaller score. Given that we will probably always have only very little different values for the distance/score, I thought about writing a new scorer that wraps multiple const scorers, one for each distance/score. The new scorer could also directly contain a vector of (BitSetDocSet, Score).

@lerouxrgd
Copy link

@maufl Yes, I was thinking of the same score. Actually the tricky part is still to get the automaton state because all we have in the API is A: Automaton and it is impossible to simply get a the Levenshtein distance from A::State (because we don't know what this type is at compile time). The only workaround I found is to use reflection at runtime. I am preparing a PR with all this.

@maufl
Copy link

maufl commented Mar 23, 2021

@lerouxrgd Alternatively you can implement Weight for both specializations of AutomatonWeight seperately. I.e. impl Weight for AutomatonWeight<Regex> and impl Weight for AutomatonWeight<DFAWrapper>. That way you will have a concrete automata type and can use the distance method directly.

@lerouxrgd
Copy link

Oh yes good idea, let's see what the maintainers think is best.

@maufl
Copy link

maufl commented Mar 23, 2021

I'm also unsure whether having an extra TermWithStateStreamerBuilder and TermWithStateStreamer is better than just merging the current_state field into TermStreamer. It seems like a lot of duplicated code to my. Let's wait and see what @fulmicoton says.

@fulmicoton
Copy link
Collaborator

Overall the PR looks good except one very important detail, I commented on the PR. Good job! The hardest part is done...

@fulmicoton
Copy link
Collaborator

@lerouxrgd @maufl Following on the PR. As we are discovering scoring really is tricky...

The more I think about it, the more I feel incorporating the lev distance with a score is hard. The best I can think for a given query term score, is to pick the minimum distance observe and take the max BM25 score for the term at this distance.

In many ways, a more satisfying to implement fuzzy search would be to use a algolia like approach.
The scoring would like as a lexicographical order of different scoring methods.

First: Do all of the terms match perfectly? If not enough doc are matching, or there is a tie, then try with
an overall levenshtein distance of 1 (sum of the lev distance of all terms).
Then try with an overall levenshtein distance of 2.
etc.
In case of a tie, break the tie with BM25.

It makes sense both as great trade off between search quality and usability, and for performance reasons.
The idea there would be to have a search routine independent from the usual tantivy paradigm of query pushing docs to collector.

The routine would just be in charge of computing a TopK.

fn fuzzy_search(query: &dyn Query, k: usize /* possibly one extra argument called ranking_policy */) -> Vec<DocAddress> {
   /// ...
}

The routine would work actually work like the policy above. Instead of trying to score all of the docs with a levenshtein 2 and pay a massive price when it is most likely not useful, try distance 0 first... and then increment from there.

In algolia, there are more criteria than just lev distance and bm25, and their order can be configured with a Policy.


Do you have projects who would benefit from this PR ? @lerouxrgd @maufl In its current shape, I suspect the approach in the PR will be not be useful to anyone and add to tantivy's tech debt.
If you do not have any use for it, I think we should drop this PR or try to extract from the PR something that could be useful for users.

(Apologies for not seeing this coming earlier)

@maufl
Copy link

maufl commented Mar 29, 2021

I have a project that is a product search for local shops. I want to build a database with the offers of the shops in my city and have a full text search for it. The search should allow fuzzy searching, i.e. return results even if there is a small typo, but I want exact matches to appear first.

@fulmicoton
Copy link
Collaborator

@maufl sweet. Do you have an idea of what is the correct order for these results? Do you want to see a result count and facetting, and how do you define that with a fuzzy query?

@maufl
Copy link

maufl commented Mar 29, 2021

@fulmicoton (Disclaimer, I don't know much about full text search in general) I'd like to sort the results based on how good the match was, i.e. whether it was an exact match or not, and maybe on the number or matches in one document. Furthermore, I might like to give matches in a certain field a higher weight, i.e. a match in "name" is better than in "description". However, I did not try to actually implement this yet.
I also want to see the result count and maybe do pagination. I don't plan to do facetting, as I don't think there will be enough data for it. (I think I will have a name and hopefully a description for each product but likely not much more).

Right now I'm building a query like this:

    fn make_query(&self, query_string: &str) -> Box<dyn Query> {
        let mut parts: Vec<(Occur, Box<dyn Query>)> = Vec::new();
        let fields = vec![
            self.name_field,
            self.description_field,
            self.category_field,
            self.color_field,
        ];
        for search_term in query_string.to_ascii_lowercase().split(" ") {
            if search_term.is_empty() {
                continue;
            }
            for field in fields.iter() {
                parts.push((
                    Occur::Should,
                    Box::new(FuzzyTermQuery::new_prefix(
                        Term::from_field_text(*field, search_term),
                        1,
                        true,
                    )),
                ))
            }
        }
        Box::new(BooleanQuery::from(parts))
    }

Then I use the Count and TopDocs collector with limit and offset.
I hope I did not (ab-)use tantivy in a completely wrong way here :D

@lerouxrgd
Copy link

@fulmicoton I am not using it directly now, so as far as I am concerned it is fine to discard this PR. I see that correct scoring is indeed tricky, however I still feel that just ordering results with 1 / (1 + dist) should be doable (and fixes this original issue) even if it is not the best possible score.

@afbarbaro
Copy link

Is there a path forward for this issue? I have a use case that really needs fuzzy search on multiple terms to handle user typos and this issue is a blocker.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants