Add possibility to boost per analyzer #4

slorber · 2013-01-01T21:51:07Z

Hello,

It would be nice to be able to give a boost per analyzer.

I mean, it I index the word "description" with edgengrams(3,7) + stemming + default

I would like to be able to say:

If a match is found thanks to edgengrams, then boost of 0.2
If a match is found thanks to stemming, then boost of 0.7
If a match is found thanks to stemming, then boost of 1

Because matches with "des" may be less relevant than matches with "descript" than matches with "description", so matches with "description" should be the firsts to come.

I don't know if it is possible to do, just a suggestion :)

Also, it would be nice to have some informations about the effects on scoring of using a combo analyzer. What came to me first was for exemple "is the order of sub analyzers important?". I think it doesn't since you mention some stuff about duplicate tokens.

ofavre · 2013-01-03T10:50:48Z

An analyzer in Lucene cannot boost its tokens. There is a BoostAttribute though, but it's deprecated in Lucene 3.6.2 and for internal purpose only in Lucene 4.0.0.

What you can do is split down the query into multiple queries, each using one of the sub-analyzer and setting the corresponding boost.

Note that if you do not deduplicate, if a same token term is generated N times, is will therefore be boosted by N.

The order of the sub-analyzers indeed changes the order of the generated tokens, but this should have no impact at all. Except if you use some filter that limits the number of tokens to the first N th.

slorber · 2013-01-03T13:50:43Z

Sorry I'm not a Lucene expert :) Thanks for the explaination.

Before using your plugin, I had a multifield on which I had the possibility to boost per field, a bit like you suggested.
But the big problem was that it conduced me to have to merge the highlights, which is not convenient (it seems even with a multifield, it output many highlights for the same "global" field 👎 ).

By the way, shouldn't deduplication be the default?
Because when matching a document on a field using a combo analyzer, it may be unexpectedly boosted, compared to a field where there is no combo analyzer.

And when using deduplication, does this mean the first duplicate token is kept, while others are dropped?
Are you sure the order doesn't have any impact, particularly on the highlighting?
Using or not term_vector=with_positions_offsets?

Check that problem I found, which I didn't solve yet.
http://stackoverflow.com/questions/11303660/elasticsearch-edgengram-highlight-term-vector-bad-highlights
I wouldn't like to have a smallest token highlight for a combo analyzer field that matches on both ngram and stemmed tokens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add possibility to boost per analyzer #4

Add possibility to boost per analyzer #4

slorber commented Jan 1, 2013

ofavre commented Jan 3, 2013

slorber commented Jan 3, 2013

Add possibility to boost per analyzer #4

Add possibility to boost per analyzer #4

Comments

slorber commented Jan 1, 2013

ofavre commented Jan 3, 2013

slorber commented Jan 3, 2013