Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add possibility to boost per analyzer #4

Open
slorber opened this issue Jan 1, 2013 · 2 comments
Open

Add possibility to boost per analyzer #4

slorber opened this issue Jan 1, 2013 · 2 comments

Comments

@slorber
Copy link

slorber commented Jan 1, 2013

Hello,

It would be nice to be able to give a boost per analyzer.

I mean, it I index the word "description" with edgengrams(3,7) + stemming + default

I would like to be able to say:

  • If a match is found thanks to edgengrams, then boost of 0.2
  • If a match is found thanks to stemming, then boost of 0.7
  • If a match is found thanks to stemming, then boost of 1

Because matches with "des" may be less relevant than matches with "descript" than matches with "description", so matches with "description" should be the firsts to come.

I don't know if it is possible to do, just a suggestion :)

Also, it would be nice to have some informations about the effects on scoring of using a combo analyzer. What came to me first was for exemple "is the order of sub analyzers important?". I think it doesn't since you mention some stuff about duplicate tokens.

@ofavre
Copy link
Contributor

ofavre commented Jan 3, 2013

An analyzer in Lucene cannot boost its tokens. There is a BoostAttribute though, but it's deprecated in Lucene 3.6.2 and for internal purpose only in Lucene 4.0.0.

What you can do is split down the query into multiple queries, each using one of the sub-analyzer and setting the corresponding boost.

Note that if you do not deduplicate, if a same token term is generated N times, is will therefore be boosted by N.

The order of the sub-analyzers indeed changes the order of the generated tokens, but this should have no impact at all. Except if you use some filter that limits the number of tokens to the first N th.

@slorber
Copy link
Author

slorber commented Jan 3, 2013

Sorry I'm not a Lucene expert :) Thanks for the explaination.

Before using your plugin, I had a multifield on which I had the possibility to boost per field, a bit like you suggested.
But the big problem was that it conduced me to have to merge the highlights, which is not convenient (it seems even with a multifield, it output many highlights for the same "global" field 👎 ).

By the way, shouldn't deduplication be the default?
Because when matching a document on a field using a combo analyzer, it may be unexpectedly boosted, compared to a field where there is no combo analyzer.

And when using deduplication, does this mean the first duplicate token is kept, while others are dropped?
Are you sure the order doesn't have any impact, particularly on the highlighting?
Using or not term_vector=with_positions_offsets?

Check that problem I found, which I didn't solve yet.
http://stackoverflow.com/questions/11303660/elasticsearch-edgengram-highlight-term-vector-bad-highlights
I wouldn't like to have a smallest token highlight for a combo analyzer field that matches on both ngram and stemmed tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants