Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Construct Phrases may result in double quotes when original search is quoted #41

Closed
rpialum opened this issue Jun 16, 2014 · 4 comments
Closed

Comments

@rpialum
Copy link
Contributor

rpialum commented Jun 16, 2014

There are a few issues that exist with the current constructPhrases logic when expanding synonyms. One of which can result in multiple quotes being applied when the original search is quoted.

ie: Search: "Internal Revenue Service" takes money
Synonyms: (IRS, tax service, internal revenue service)

Current results: "IRS" takes money; ""tax service"" takes money; ""internal revenue service"" takes money.

My proposed solution involves making a few fixes with in generateSynonymQueries(), all when SynonymDismaxParams.SYNONYMS_CONSTRUCT_PHRASES has been set to true.

  1. Only apply quotes when the synonym term is a phrase (more than one term).
  2. Only apply quotes when the synonym phrase is not already surrounded by quotes.

Changes:

Add to top of generateSynonymQueries():
String origQuery = getQueryStringFromParser();
int queryLen = origQuery.length();

// TODO: make the token stream reusable?
TokenStream tokenStream = synonymAnalyzer.tokenStream(SynonymDismaxConst.IMPOSSIBLE_FIELD_NAME,
new StringReader(origQuery));

Replace current phraseQuery if logic with:
if (constructPhraseQueries && typeAttribute.type().equals("SYNONYM") &&
termToAdd.contains(" "))
{
//Dont' Quote when original is already surrounded by quotes
if( offsetAttribute.startOffset()==0 ||
offsetAttribute.endOffset() == queryLen ||
origQuery.charAt(offsetAttribute.startOffset()-1)!='"' ||
origQuery.charAt(offsetAttribute.endOffset())!='"')
{
// make a phrase out of the synonym
termToAdd = new StringBuilder(termToAdd).insert(0,'"').append('"').toString();
}
}

@nolanlawson
Copy link
Member

Thanks for raising the issue. I agree it's a problem and will look into applying the patch. If you'd like the process to go faster, though, please submit a formal PR and also a unit test showing that your fix works. The unit tests are all done in Python and should be fairly easy to understand; there are instructions in the readme.

@rpialum
Copy link
Contributor Author

rpialum commented Jun 17, 2014

Thanks for the fast response and for all the hard work you've done with this project. I assume PR refers to Pull Request (I've only ever used github for pulling code, rather than contributing). I'm also a novice when it comes to SOLR internals and configuration functionality (Filters/Tokenizers), though looking through yours and Tiens Multi-term synonym logic this past week has given me a bit of a crash course.

One quick question: How do the various Tokenizers and Filters interact when there are query time Tokenizer and Filter specified on the field being queried when using the synonym_edismax parser? Is there a specific order in which they are applied or does one super-cede the other orare they completely independent of the other?

Our schema file specifies the following in our schema file for the field we're expanding synonyms on:

<fieldType name="our_text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
            words="lang/stopwords_en.txt"  />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>

@nolanlawson
Copy link
Member

@rpialum No problem, and yes PR is a pull request. :)

I'm not sure how to answer your question, but if you use the debug UI in the solr admin, you should be able to see how the filter factories and tokenizers are being successively applied to the input.

Also, as for your schema, you can add it to the sample schema, which is also what's used in the unit test. So that way, you should be able to get the unit tests running. I.e. these files are what's used in the unit tests. Ping me if anything else is unclear; hope that helps!

rpialum added a commit to rpialum/hon-lucene-synonyms that referenced this issue Jun 18, 2014
Issue healthonnet#41, when original search term is quoted and synonym being expanded is item quoted don't re-quote when doing constructPhraseQueries.  Removes issue where double quotes appear in result.  Also when doing constructPhraseQueries, only quote phrases not single term synonyms.
rpialum added a commit to rpialum/hon-lucene-synonyms that referenced this issue Jun 18, 2014
Stubbed out test case for issue healthonnet#41.  Additional logic needs to be implemented for testing the results, see TODO.  I've been unable to get Python(v3.4.1) to run the top level file, but am assuming that debugQuery needs to be set on and the contents of the 'expandedSynonyms' in the response needs to be analyized.  Currently I'm assuming a count of the number of quotes would work, though the exact expected string could also be passed in.
rpialum added a commit to rpialum/hon-lucene-synonyms that referenced this issue Jun 18, 2014
Stubbed out test case for issue healthonnet#41. Additional logic needs to be implemented for testing the results, see TODO. I've been unable to get Python(v3.4.1) to run the top level file, but am assuming that debugQuery needs to be set on and the contents of the 'expandedSynonyms' in the response needs to be analyized. Currently I'm assuming a count of the number of quotes would work, though the exact expected string could also be passed in.
nolanlawson pushed a commit that referenced this issue Oct 4, 2014
Issue #41, when original search term is quoted and synonym being expanded is item quoted don't re-quote when doing constructPhraseQueries.  Removes issue where double quotes appear in result.  Also when doing constructPhraseQueries, only quote phrases not single term synonyms.
@nolanlawson
Copy link
Member

fixed in 242d330

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants