QueryStringQuery doesn't properly account for analysers splitting up strings #248

bcampbell · 2015-10-13T00:42:47Z

The QueryStringQuery behaviour doesn't always give the results you'd expect.
(this is a follow-up of: https://groups.google.com/forum/#!topic/bleve/cxVfZ7VQh3o )

Observed behaviour:

Using the default "en" analyser, you'd expect an unquoted query like:
mother-in-law
to be treated a single 'thing' and to match as a phrase.
Instead, the query is treated as mother OR law (the "in" is discarded as a stopword).

Expected behaviour:

mother-in-law in the above example should be treated as a phrase.
The "en" analyzer splits mother-in-law up into [mother (pos 0), law (pos 2)], and the query should be a MatchPhraseQuery instead of the current MatchQuery.

The text was updated successfully, but these errors were encountered:

mschoch · 2016-06-26T12:32:24Z

So, looking at this issue again with fresh eyes it seems to me like it's functioning as desired. Our whole search works by applying the same analyzer to search terms and documents. The "en" analyzer turns "mother-in-law" into "mother" and "law". It's not clear to me how we would treat this as some sort of obvious exception.

I will try it with Elasticearch and see what it does.

bcampbell · 2016-06-26T22:28:44Z

I'd agree that it's a little obscure (although my users have definitely been really confused by this in the past).

There are two levels of string-splitting going on: the first is where the query parser breaks the input using whitespace. The second is in the Analyser, which could potentially break a string up further, into multiple terms. I think the user intuitively understands that whitespace breaks things up. But I don't think they realise the Analyser might break things up further.

I think the easy and intuitive thing to do is for QueryStringQuery to just use MatchPhraseQuery instead of MatchQuery.

So the query:

ill-gotten gains

Would be treated as:

"ill-gotten" "gains"

thus preserving the ordering of "ill" and "gotton" (using the default 'en' analyser), but not really making any difference to "gains".
I can't think of any cases where using MatchPhraseQuery by default would screw things up, but obviously I could be really wrong about that ;- )
My main concern would be a possible performance implication, but I'd hope that single-term MatchPhraseQuerys were equivalent to MatchQuery anyway...

bcampbell · 2016-06-26T22:31:37Z

Just for background - as mentioned in the original mailing list thread, my motivation is for matching URLs in fields. My users really expect that:

url:/sport-section/

Would match /articles/sport-section/latest-cricket-scandal but definitely not /articles/politics-section/terrorists-are-just-bad-sports/

I could try and train them that whitespace isn't the only place where strings are broken up and that they need to quote stuff like this, but it seems better to make the default behaviour "feel" right, if possible.

(My alternate query parser already uses MatchPhraseQuery by default. I've not noticed any problems, but then I doubt it's been given the workout that QueryStringQuery has had.)

bcampbell changed the title ~~QueryStringQuery doesn't account for analysers splitting up strings~~ QueryStringQuery doesn't properly account for analysers splitting up strings Oct 13, 2015

mschoch added the bug label Jun 26, 2016

mschoch added this to the 1.0 milestone Jun 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QueryStringQuery doesn't properly account for analysers splitting up strings #248

QueryStringQuery doesn't properly account for analysers splitting up strings #248

bcampbell commented Oct 13, 2015

mschoch commented Jun 26, 2016

bcampbell commented Jun 26, 2016

bcampbell commented Jun 26, 2016 •

edited

Loading

QueryStringQuery doesn't properly account for analysers splitting up strings #248

QueryStringQuery doesn't properly account for analysers splitting up strings #248

Comments

bcampbell commented Oct 13, 2015

mschoch commented Jun 26, 2016

bcampbell commented Jun 26, 2016

bcampbell commented Jun 26, 2016 • edited Loading

bcampbell commented Jun 26, 2016 •

edited

Loading