Construct Phrases may result in double quotes when original search is quoted #41

rpialum · 2014-06-16T16:18:25Z

There are a few issues that exist with the current constructPhrases logic when expanding synonyms. One of which can result in multiple quotes being applied when the original search is quoted.

ie: Search: "Internal Revenue Service" takes money
Synonyms: (IRS, tax service, internal revenue service)

Current results: "IRS" takes money; ""tax service"" takes money; ""internal revenue service"" takes money.

My proposed solution involves making a few fixes with in generateSynonymQueries(), all when SynonymDismaxParams.SYNONYMS_CONSTRUCT_PHRASES has been set to true.

Only apply quotes when the synonym term is a phrase (more than one term).
Only apply quotes when the synonym phrase is not already surrounded by quotes.

Changes:

Add to top of generateSynonymQueries():
String origQuery = getQueryStringFromParser();
int queryLen = origQuery.length();

// TODO: make the token stream reusable?
TokenStream tokenStream = synonymAnalyzer.tokenStream(SynonymDismaxConst.IMPOSSIBLE_FIELD_NAME,
new StringReader(origQuery));

Replace current phraseQuery if logic with:
if (constructPhraseQueries && typeAttribute.type().equals("SYNONYM") &&
termToAdd.contains(" "))
{
//Dont' Quote when original is already surrounded by quotes
if( offsetAttribute.startOffset()==0 ||
offsetAttribute.endOffset() == queryLen ||
origQuery.charAt(offsetAttribute.startOffset()-1)!='"' ||
origQuery.charAt(offsetAttribute.endOffset())!='"')
{
// make a phrase out of the synonym
termToAdd = new StringBuilder(termToAdd).insert(0,'"').append('"').toString();
}
}

nolanlawson · 2014-06-16T17:06:56Z

Thanks for raising the issue. I agree it's a problem and will look into applying the patch. If you'd like the process to go faster, though, please submit a formal PR and also a unit test showing that your fix works. The unit tests are all done in Python and should be fairly easy to understand; there are instructions in the readme.

rpialum · 2014-06-17T14:31:25Z

Thanks for the fast response and for all the hard work you've done with this project. I assume PR refers to Pull Request (I've only ever used github for pulling code, rather than contributing). I'm also a novice when it comes to SOLR internals and configuration functionality (Filters/Tokenizers), though looking through yours and Tiens Multi-term synonym logic this past week has given me a bit of a crash course.

One quick question: How do the various Tokenizers and Filters interact when there are query time Tokenizer and Filter specified on the field being queried when using the synonym_edismax parser? Is there a specific order in which they are applied or does one super-cede the other orare they completely independent of the other?

Our schema file specifies the following in our schema file for the field we're expanding synonyms on:

<fieldType name="our_text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
            words="lang/stopwords_en.txt"  />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>

nolanlawson · 2014-06-17T16:01:35Z

@rpialum No problem, and yes PR is a pull request. :)

I'm not sure how to answer your question, but if you use the debug UI in the solr admin, you should be able to see how the filter factories and tokenizers are being successively applied to the input.

Also, as for your schema, you can add it to the sample schema, which is also what's used in the unit test. So that way, you should be able to get the unit tests running. I.e. these files are what's used in the unit tests. Ping me if anything else is unclear; hope that helps!

Issue healthonnet#41, when original search term is quoted and synonym being expanded is item quoted don't re-quote when doing constructPhraseQueries. Removes issue where double quotes appear in result. Also when doing constructPhraseQueries, only quote phrases not single term synonyms.

Stubbed out test case for issue healthonnet#41. Additional logic needs to be implemented for testing the results, see TODO. I've been unable to get Python(v3.4.1) to run the top level file, but am assuming that debugQuery needs to be set on and the contents of the 'expandedSynonyms' in the response needs to be analyized. Currently I'm assuming a count of the number of quotes would work, though the exact expected string could also be passed in.

Issue #41, when original search term is quoted and synonym being expanded is item quoted don't re-quote when doing constructPhraseQueries. Removes issue where double quotes appear in result. Also when doing constructPhraseQueries, only quote phrases not single term synonyms.

nolanlawson · 2014-10-04T18:53:20Z

fixed in 242d330

rpialum mentioned this issue Jun 18, 2014

Patch 1 Issue 41 #42

Closed

nolanlawson closed this as completed Oct 4, 2014

joekiller mentioned this issue Dec 9, 2015

Bug in #41 fix for Solr 5.3.1 #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Construct Phrases may result in double quotes when original search is quoted #41

Construct Phrases may result in double quotes when original search is quoted #41

rpialum commented Jun 16, 2014

nolanlawson commented Jun 16, 2014

rpialum commented Jun 17, 2014

nolanlawson commented Jun 17, 2014

nolanlawson commented Oct 4, 2014

Construct Phrases may result in double quotes when original search is quoted #41

Construct Phrases may result in double quotes when original search is quoted #41

Comments

rpialum commented Jun 16, 2014

Changes:

nolanlawson commented Jun 16, 2014

rpialum commented Jun 17, 2014

nolanlawson commented Jun 17, 2014

nolanlawson commented Oct 4, 2014