score fuzzy term by distance #563

buster · 2019-06-10T10:52:02Z

When playing around with tantivy, i came up with a query including multiple fuzzy term queries.
Unvortunately the results don't make much sense (e.g. imagine a text field "the quick brown fox" and some other completely different text fields and then do a boolean query with multiple fuzzy terms. The results are completely unusable, because even an exact search term results in some completely different term scored the same).
I think it comes down to this: The single fuzzy term query doesn't make a difference between exact matches or some levenshtein distance.
The test should find the second added document first (with a higher score), but doesn't.

#[cfg(test)]
mod test {
    use tantivy::query::FuzzyTermQuery;
    use tantivy::collector::TopDocs;
    use tantivy::schema::Schema;
    use tantivy::schema::{TEXT, STORED};
    use tantivy::Index;
    use tantivy::Term;

    #[test]
    pub fn test_fuzzy_term() {
        let mut schema_builder = Schema::builder();
        let country_field = schema_builder.add_text_field("country", TEXT | STORED);
        let schema = schema_builder.build();
        let index = Index::create_in_ram(schema);
        {
            let mut index_writer = index.writer_with_num_threads(1, 10_000_000).unwrap();
            index_writer.add_document(doc!(
                country_field => "WENN ROT WIE RUBIN",
            ));
            index_writer.add_document(doc!(
                country_field => "WENN ROT WIE ROBIN",
            ));
            index_writer.commit().unwrap();
        }
        let reader = index.reader().unwrap();
        let searcher = reader.searcher();
        {
            let term = Term::from_field_text(country_field, "robin");

            let fuzzy_query = FuzzyTermQuery::new(term, 2, true);
            let top_docs = searcher
                .search(&fuzzy_query, &TopDocs::with_limit(100))
                .unwrap();
            assert_eq!(top_docs.len(), 2, "Expected 2 documents");
            let (score, adr) = top_docs[0];
            let document = index.schema().to_named_doc(&searcher.doc(adr).expect("document"));
            let json = index.schema().to_json(&searcher.doc(adr).expect("document"));
            println!("{}", json);
            println!("{:?}", top_docs);
            assert_eq!(document.0.get("country").expect("second document")[0].text().unwrap(), "WENN ROT WIE ROBIN");
        }
    }
}

Somehow the distance needs to be considered in the scoring, else the fuzzy query doesn't make much sense.

Or is there some other way to achieve this, i didn't see?

The text was updated successfully, but these errors were encountered:

fulmicoton · 2019-06-10T12:17:37Z

You are absolutely correct.

So a simple solution would be to change the score of the fuzzytermquery depending on the distance it hit.

Bad news it was never implemented because the fst automaton did not expose the automaton state when streaming over matching terms.
Good news, partly for this reason I forked the fst crate and someone already wrote a PR that has never been imported in the fst crate.

We simply need to merge it in tantivy-fst and tweak the score in the fuzzytermquery to solve your issue. Are you interested in the ticket?

buster · 2019-06-10T12:26:05Z

Yes, i'm interested in this :)

fulmicoton · 2019-06-14T00:45:50Z

Rephrasing. Are you interesting in addressing this ticket? Do you need pointers to the stuff I was talking about above?

buster · 2019-06-14T07:20:09Z

I didn't understand that,but if you can point out the relevant pieces, I'd give it a try.

fulmicoton · 2019-06-14T07:32:37Z

no problem I can search for volunteers on the chat :)

buster · 2019-06-23T04:45:16Z

Just wondering: How is this going? Is there a branch for this feature? Also, where is the fst crate fork and the PR?

hntd187 · 2019-06-23T06:02:19Z

The PR for fst was since deleted, I can't find the commit anywhere, so it'll have to ported over by hand. This is gonna take some time. I probably won't have much time to look at this in the next few days, but I'll take to the grind a bit later in the week.

hntd187 · 2019-06-26T20:47:55Z

@fulmicoton how do we want to tackle this PR for fst? I don't see much way other than implementing it by hand from that PR, which I had to dig up, but the commit itself is gone.

fulmicoton · 2019-06-27T00:39:04Z

@hntd187 Can you ask his author? He is nice and usually replies really fast.

Also if you do not rewrite this code yourself, make sure that there is no license conflict and he gets the credit somewhere and his happy with the visibility of this credit.

trinity-1686a · 2019-07-26T19:53:19Z

@hntd187 is it pr 61? If so, you can get the commit back by running git fetch origin +refs/pull/61/head && git checkout FETCH_HEAD

hntd187 · 2019-08-12T04:04:58Z

So @fulmicoton we got this in fst now, we're all ready to hook it up in tantivy right? I'll get that started if you have any more thoughts from out gitter convo

lerouxrgd · 2021-03-17T10:19:10Z

Could this be solved by leveraging StreamWithState from fst crate which is now available (details in this commit) ?

lerouxrgd · 2021-03-17T21:08:59Z

If my understanding is correct, here are the things to do:

Add a search_with_state method to tantivy_fst::Map like fst::Map::search_with_state in fst.
Add a TermWithStateStreamerBuilder (similar to TermStreamerBuilder) to tantivy::termdict::fst_termdict::streamer but that wraps a tantivy_fst::map::StreamWithStateBuilder instead of a tantivy_fst::map::StreamBuilder, with functionalities similar to fst::map::StreamWithStateBuilder in fst.
Add a TermWithStateStreamer (similar to TermStreamer) to tantivy::termdict::fst_termdict::streamer but that wraps a tantivy_fst::map::StreamWithState instead of a tantivy_fst::map::Stream.
Add a method search_with_state to tantivy::fst_termdict::termdict::TermDictionary that returns a TermWithStateStreamerBuilder.
Add a private method automaton_stream_with_state to tantivy::query::AutomatonWeight (similar to automaton_stream) but that leverages TermDictionary::search_with_state to return a TermWithStateStreamer through TermWithStateStreamerBuilder::into_stream.
Use automaton_stream_with_state instead of automaton_stream in AutomatonWeight::scorer.
Retrieve the now available A::State in TermWithStateStreamer::advance (similar to TermStreamer::advance) and return it in addition to TermInfo in TermWithStateStreamer::value (similar to TermStreamer::value).
Make use of A::State returned by TermWithStateStreamer::value in AutomatonWeight::scorer to score the documents of the doc_bitset retrieved with term_info. Finally return an appropriate Scorer implementation (probably different from the current ConstScorer).

That last step isn't completely clear for me, however do these steps look good ?

fulmicoton · 2021-03-18T02:12:31Z

@lerouxrgd This is a perfect summary of what needs to done.

The last step is not too difficult. The state can be converted into a levenshtein distance.
https://github.com/tantivy-search/levenshtein-automata/blob/master/src/dfa.rs#L70

Scoring given (overall number of matching docs, docfreq matching term, levenshtein_distance) is then a non-trivial problem.

maufl · 2021-03-18T09:52:22Z

@lerouxrgd Are you planning to work on it? I was thinking on giving it a try when I understand everything.

What I'm currently not understanding is, the Scorer trait represents a set of documents (as it can be downcast to a DocSet? This part was difficult to follow in the code and documentation), but the score method returns only a single Score? It seems to me, documents matched by the same Query -> Weight -> Scorer are supposed to have the same score? You could of course return a different score, depending on the current doc() of the DocSet, but that is probably not very intuitive.

fulmicoton · 2021-03-18T11:04:10Z

@maufl Generally the scorer is kind of an iterator, on the sorted document set. Due to the strange nature of queries based on the automatonweight, the docset is actually entirely precalculated and stored into a bitset, and this is the iterator that is returned.

The iterator is not a rust Iterator...
You are meant to call advance to advance it, and then .doc() and .score() respectively give you the doc id and the score of the current document.

maufl · 2021-03-18T12:04:43Z

@fulmicoton Understood, then I think my intuition was right.

@lerouxrgd I think instead of implementing search_with_state in fst we can use StreamBuilder::with_state as implemented by @hntd187. We could extend the TermStreamer to take a StreamWithStateand hold a current_state: A::State. This would require much less changes. The only downside is, that the TermStreamer will then always make a clone of the state.

lerouxrgd · 2021-03-18T13:11:43Z

@maufl Yes good point, it might be better to use StreamBuilder::with_state instead. I guess I can give all this a try, at least until the last step I listed where the real calculation happens. I am also a bit confused by the DocSet and Scorer interactions, I would need to dig into TermWeight::specialized_scorer and TermScorer to understand how to score documents of a result set differently.

@fulmicoton Thank you for confirming this is correct. Besides, what would prevent regular BM25 scoring in AutomatonWeight::scorer instead of ConstScorer with 1.0 ? Do we have enough info at this moment to calculate it through BM25Weight::for_one_term for instance ?

lerouxrgd · 2021-03-19T18:10:04Z

@maufl FYI I implemented the steps I listed that bubble up Automaton::State in AutomatonWeight::scorer in this branch following your advice. Now I guess the question is, how to score documents.

maufl · 2021-03-23T15:49:39Z

@lerouxrgd That looks very good. I was thinking about the scoring and scorer. What about calculating the score as 1 / (1 + distance)? For a perfect match we would have score 1 and for a greater distance a smaller score. Given that we will probably always have only very little different values for the distance/score, I thought about writing a new scorer that wraps multiple const scorers, one for each distance/score. The new scorer could also directly contain a vector of (BitSetDocSet, Score).

lerouxrgd · 2021-03-23T18:59:36Z

@maufl Yes, I was thinking of the same score. Actually the tricky part is still to get the automaton state because all we have in the API is A: Automaton and it is impossible to simply get a the Levenshtein distance from A::State (because we don't know what this type is at compile time). The only workaround I found is to use reflection at runtime. I am preparing a PR with all this.

maufl · 2021-03-23T19:34:15Z

@lerouxrgd Alternatively you can implement Weight for both specializations of AutomatonWeight seperately. I.e. impl Weight for AutomatonWeight<Regex> and impl Weight for AutomatonWeight<DFAWrapper>. That way you will have a concrete automata type and can use the distance method directly.

lerouxrgd · 2021-03-23T19:40:00Z

Oh yes good idea, let's see what the maintainers think is best.

maufl · 2021-03-23T19:47:37Z

I'm also unsure whether having an extra TermWithStateStreamerBuilder and TermWithStateStreamer is better than just merging the current_state field into TermStreamer. It seems like a lot of duplicated code to my. Let's wait and see what @fulmicoton says.

fulmicoton · 2021-03-25T12:44:05Z

Overall the PR looks good except one very important detail, I commented on the PR. Good job! The hardest part is done...

fulmicoton · 2021-03-29T00:53:12Z

@lerouxrgd @maufl Following on the PR. As we are discovering scoring really is tricky...

The more I think about it, the more I feel incorporating the lev distance with a score is hard. The best I can think for a given query term score, is to pick the minimum distance observe and take the max BM25 score for the term at this distance.

In many ways, a more satisfying to implement fuzzy search would be to use a algolia like approach.
The scoring would like as a lexicographical order of different scoring methods.

First: Do all of the terms match perfectly? If not enough doc are matching, or there is a tie, then try with
an overall levenshtein distance of 1 (sum of the lev distance of all terms).
Then try with an overall levenshtein distance of 2.
etc.
In case of a tie, break the tie with BM25.

It makes sense both as great trade off between search quality and usability, and for performance reasons.
The idea there would be to have a search routine independent from the usual tantivy paradigm of query pushing docs to collector.

The routine would just be in charge of computing a TopK.

fn fuzzy_search(query: &dyn Query, k: usize /* possibly one extra argument called ranking_policy */) -> Vec<DocAddress> {
   /// ...
}

The routine would work actually work like the policy above. Instead of trying to score all of the docs with a levenshtein 2 and pay a massive price when it is most likely not useful, try distance 0 first... and then increment from there.

In algolia, there are more criteria than just lev distance and bm25, and their order can be configured with a Policy.

Do you have projects who would benefit from this PR ? @lerouxrgd @maufl In its current shape, I suspect the approach in the PR will be not be useful to anyone and add to tantivy's tech debt.
If you do not have any use for it, I think we should drop this PR or try to extract from the PR something that could be useful for users.

(Apologies for not seeing this coming earlier)

maufl · 2021-03-29T08:57:59Z

I have a project that is a product search for local shops. I want to build a database with the offers of the shops in my city and have a full text search for it. The search should allow fuzzy searching, i.e. return results even if there is a small typo, but I want exact matches to appear first.

fulmicoton · 2021-03-29T10:00:26Z

@maufl sweet. Do you have an idea of what is the correct order for these results? Do you want to see a result count and facetting, and how do you define that with a fuzzy query?

maufl · 2021-03-29T10:18:10Z

@fulmicoton (Disclaimer, I don't know much about full text search in general) I'd like to sort the results based on how good the match was, i.e. whether it was an exact match or not, and maybe on the number or matches in one document. Furthermore, I might like to give matches in a certain field a higher weight, i.e. a match in "name" is better than in "description". However, I did not try to actually implement this yet.
I also want to see the result count and maybe do pagination. I don't plan to do facetting, as I don't think there will be enough data for it. (I think I will have a name and hopefully a description for each product but likely not much more).

Right now I'm building a query like this:

    fn make_query(&self, query_string: &str) -> Box<dyn Query> {
        let mut parts: Vec<(Occur, Box<dyn Query>)> = Vec::new();
        let fields = vec![
            self.name_field,
            self.description_field,
            self.category_field,
            self.color_field,
        ];
        for search_term in query_string.to_ascii_lowercase().split(" ") {
            if search_term.is_empty() {
                continue;
            }
            for field in fields.iter() {
                parts.push((
                    Occur::Should,
                    Box::new(FuzzyTermQuery::new_prefix(
                        Term::from_field_text(*field, search_term),
                        1,
                        true,
                    )),
                ))
            }
        }
        Box::new(BooleanQuery::from(parts))
    }

Then I use the Count and TopDocs collector with limit and offset.
I hope I did not (ab-)use tantivy in a completely wrong way here :D

lerouxrgd · 2021-03-30T09:35:55Z

@fulmicoton I am not using it directly now, so as far as I am concerned it is fine to discard this PR. I see that correct scoring is indeed tricky, however I still feel that just ordering results with 1 / (1 + dist) should be doable (and fixes this original issue) even if it is not the best possible score.

afbarbaro · 2024-06-07T10:17:19Z

Is there a path forward for this issue? I have a use case that really needs fuzzy search on multiple terms to handle user typos and this issue is a blocker.
Thanks!

hntd187 self-assigned this Jun 16, 2019

fulmicoton added this to the 0.11 milestone Jul 18, 2019

fulmicoton added enhancement high priority labels Jul 18, 2019

lerouxrgd mentioned this issue Mar 23, 2021

Use Levenshtein distance to score documents in fuzzy term queries #998

Closed

joepio mentioned this issue Jan 3, 2022

Various search improvement suggestions atomicdata-dev/atomic-server#210

Open

7 tasks

shikhar mentioned this issue Feb 1, 2022

Handle fuzzy search #85

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

score fuzzy term by distance #563

score fuzzy term by distance #563

buster commented Jun 10, 2019

fulmicoton commented Jun 10, 2019

buster commented Jun 10, 2019

fulmicoton commented Jun 14, 2019

buster commented Jun 14, 2019

fulmicoton commented Jun 14, 2019

buster commented Jun 23, 2019

hntd187 commented Jun 23, 2019

hntd187 commented Jun 26, 2019

fulmicoton commented Jun 27, 2019

trinity-1686a commented Jul 26, 2019

hntd187 commented Aug 12, 2019

lerouxrgd commented Mar 17, 2021 •

edited

Loading

lerouxrgd commented Mar 17, 2021

fulmicoton commented Mar 18, 2021

maufl commented Mar 18, 2021

fulmicoton commented Mar 18, 2021

maufl commented Mar 18, 2021

lerouxrgd commented Mar 18, 2021

lerouxrgd commented Mar 19, 2021

maufl commented Mar 23, 2021

lerouxrgd commented Mar 23, 2021

maufl commented Mar 23, 2021

lerouxrgd commented Mar 23, 2021

maufl commented Mar 23, 2021

fulmicoton commented Mar 25, 2021

fulmicoton commented Mar 29, 2021

maufl commented Mar 29, 2021

fulmicoton commented Mar 29, 2021

maufl commented Mar 29, 2021

lerouxrgd commented Mar 30, 2021

afbarbaro commented Jun 7, 2024

score fuzzy term by distance #563

score fuzzy term by distance #563

Comments

buster commented Jun 10, 2019

fulmicoton commented Jun 10, 2019

buster commented Jun 10, 2019

fulmicoton commented Jun 14, 2019

buster commented Jun 14, 2019

fulmicoton commented Jun 14, 2019

buster commented Jun 23, 2019

hntd187 commented Jun 23, 2019

hntd187 commented Jun 26, 2019

fulmicoton commented Jun 27, 2019

trinity-1686a commented Jul 26, 2019

hntd187 commented Aug 12, 2019

lerouxrgd commented Mar 17, 2021 • edited Loading

lerouxrgd commented Mar 17, 2021

fulmicoton commented Mar 18, 2021

maufl commented Mar 18, 2021

fulmicoton commented Mar 18, 2021

maufl commented Mar 18, 2021

lerouxrgd commented Mar 18, 2021

lerouxrgd commented Mar 19, 2021

maufl commented Mar 23, 2021

lerouxrgd commented Mar 23, 2021

maufl commented Mar 23, 2021

lerouxrgd commented Mar 23, 2021

maufl commented Mar 23, 2021

fulmicoton commented Mar 25, 2021

fulmicoton commented Mar 29, 2021

maufl commented Mar 29, 2021

fulmicoton commented Mar 29, 2021

maufl commented Mar 29, 2021

lerouxrgd commented Mar 30, 2021

afbarbaro commented Jun 7, 2024

lerouxrgd commented Mar 17, 2021 •

edited

Loading