Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QueryStringQuery doesn't properly account for analysers splitting up strings #248

Open
bcampbell opened this issue Oct 13, 2015 · 3 comments
Labels
Milestone

Comments

@bcampbell
Copy link
Contributor

The QueryStringQuery behaviour doesn't always give the results you'd expect.
(this is a follow-up of: https://groups.google.com/forum/#!topic/bleve/cxVfZ7VQh3o )

Observed behaviour:

Using the default "en" analyser, you'd expect an unquoted query like:
mother-in-law
to be treated a single 'thing' and to match as a phrase.
Instead, the query is treated as mother OR law (the "in" is discarded as a stopword).

Expected behaviour:

mother-in-law in the above example should be treated as a phrase.
The "en" analyzer splits mother-in-law up into [mother (pos 0), law (pos 2)], and the query should be a MatchPhraseQuery instead of the current MatchQuery.

@bcampbell bcampbell changed the title QueryStringQuery doesn't account for analysers splitting up strings QueryStringQuery doesn't properly account for analysers splitting up strings Oct 13, 2015
@mschoch mschoch added the bug label Jun 26, 2016
@mschoch mschoch added this to the 1.0 milestone Jun 26, 2016
@mschoch
Copy link
Contributor

mschoch commented Jun 26, 2016

So, looking at this issue again with fresh eyes it seems to me like it's functioning as desired. Our whole search works by applying the same analyzer to search terms and documents. The "en" analyzer turns "mother-in-law" into "mother" and "law". It's not clear to me how we would treat this as some sort of obvious exception.

I will try it with Elasticearch and see what it does.

@bcampbell
Copy link
Contributor Author

I'd agree that it's a little obscure (although my users have definitely been really confused by this in the past).

There are two levels of string-splitting going on: the first is where the query parser breaks the input using whitespace. The second is in the Analyser, which could potentially break a string up further, into multiple terms. I think the user intuitively understands that whitespace breaks things up. But I don't think they realise the Analyser might break things up further.

I think the easy and intuitive thing to do is for QueryStringQuery to just use MatchPhraseQuery instead of MatchQuery.

So the query:

ill-gotten gains

Would be treated as:

"ill-gotten" "gains"

thus preserving the ordering of "ill" and "gotton" (using the default 'en' analyser), but not really making any difference to "gains".
I can't think of any cases where using MatchPhraseQuery by default would screw things up, but obviously I could be really wrong about that ;- )
My main concern would be a possible performance implication, but I'd hope that single-term MatchPhraseQuerys were equivalent to MatchQuery anyway...

@bcampbell
Copy link
Contributor Author

bcampbell commented Jun 26, 2016

Just for background - as mentioned in the original mailing list thread, my motivation is for matching URLs in fields. My users really expect that:

url:/sport-section/

Would match /articles/sport-section/latest-cricket-scandal but definitely not /articles/politics-section/terrorists-are-just-bad-sports/

I could try and train them that whitespace isn't the only place where strings are broken up and that they need to quote stuff like this, but it seems better to make the default behaviour "feel" right, if possible.

(My alternate query parser already uses MatchPhraseQuery by default. I've not noticed any problems, but then I doubt it's been given the workout that QueryStringQuery has had.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants