Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate max_token_score field of neural_sparse query #478

Merged
merged 7 commits into from
Nov 22, 2023

Conversation

zhichao-aws
Copy link
Member

@zhichao-aws zhichao-aws commented Oct 30, 2023

Description

In neural_sparse query, max_token_score field was used for sub-clause pruning of WAND scorer (lucene 9.7). Since we'll upgrade to lucene 9.8 in the next release, the inner logic in lucene changed and we don't need this field any more. We need to deprecate this field. To be more specific, in 2.x, user can still set this field but we'll ignore this and give a warning log. In 3.x, we don't parse this field.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov
Copy link

codecov bot commented Oct 30, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (ef19ffa) 80.95% compared to head (617f154) 85.53%.

Additional details and impacted files
@@             Coverage Diff              @@
##                2.x     #478      +/-   ##
============================================
+ Coverage     80.95%   85.53%   +4.58%     
- Complexity      512      516       +4     
============================================
  Files            41       40       -1     
  Lines          1591     1521      -70     
  Branches        247      238       -9     
============================================
+ Hits           1288     1301      +13     
+ Misses          197      112      -85     
- Partials        106      108       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zhichao-aws zhichao-aws marked this pull request as ready for review October 30, 2023 03:41
Copy link
Collaborator

@model-collapse model-collapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the coverage thing.

@navneet1v
Copy link
Collaborator

Since we'll upgrade to lucene 9.8 in the next release, the inner logic in lucene changed and we don't need this field any more.

@zhichao-aws can you share some light around how lucene is taking care of this? Does lucene has added this capability? how does this impact the customer queries and search relevancy who is using this field.

@navneet1v
Copy link
Collaborator

@zhichao-aws can you also add results of the tests which has been done, to ensure that if customer doesn't provide this deprecated field the queries are not impacted.

@navneet1v
Copy link
Collaborator

@zhichao-aws can you also add details on how we tested the upgrades?

@VisibleForTesting
static final ParseField MAX_TOKEN_SCORE_FIELD = new ParseField("max_token_score");
static final ParseField MAX_TOKEN_SCORE_FIELD = new ParseField("max_token_score").withAllDeprecated();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add the @deprecated annotation on top of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -88,7 +93,6 @@ public void testFromXContent_whenBuiltWithOptionals_thenBuildSuccessfully() {
"VECTOR_FIELD": {
"query_text": "string",
"model_id": "string",
"max_token_score": 123.0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep a unit test where we are providing the deprecated field and no impact on queries are happening.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have this.

@zhichao-aws
Copy link
Member Author

zhichao-aws commented Nov 3, 2023

Since we'll upgrade to lucene 9.8 in the next release, the inner logic in lucene changed and we don't need this field any more.

@zhichao-aws can you share some light around how lucene is taking care of this? Does lucene has added this capability? how does this impact the customer queries and search relevancy who is using this field.

Please refer this lucene PR for more details. In short words, Lucene used ImpactsDISI to skip hits based on minimum competitive score. But lucene contributors found ImpactsDISI does hurt more than it helps on average, because ImpactsDISI adds quite some overhead and the per-clauses minimum scores are usually so low that they don't actually enable skipping hits. Now only top level scoring clause will use ImpactsDISI for pruning. FeatureQuery is not the top level scoring clause, and it won't use the ImpactsDISI. https://github.com/apache/lucene/pull/12490/files#diff-6ca6d673f9d09efdca430f2d5a381fbd862fd74385b778e45ade26fe112bca85

So in our plugin, we don't need to provide the score upperbound for FeatureQuery after lucene 9.8. We conducted some tests, this does reduce neural_sparse latency at a large margin. For 8.8 million docs case the latency was reduced to about 1/3 compared with lucene 9.7 case. This optimization doesn't affect the search result, only improves the speed of searching in shards. So it won't hurt the search relevance. Since we still keep the api compatibility in 2.x, users can still use existing queries, we'll ignore the max_token_score field and give a warning log.

@zhichao-aws
Copy link
Member Author

@zhichao-aws can you also add results of the tests which has been done, to ensure that if customer doesn't provide this deprecated field the queries are not impacted.

The deprecated field was optional in 2.11 release, so customers can always not privide this field.

@zhichao-aws
Copy link
Member Author

Hi @navneet1v , I've added some comments and commits, could you please help review again?

@zhichao-aws
Copy link
Member Author

BTW, this should be merged before we bump version to 2.12.0. For the old code there will be compilation errors due to the changes of lucene internal interface.

Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
@zhichao-aws
Copy link
Member Author

force push to rebase 2.x

Copy link
Collaborator

@model-collapse model-collapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zane-neo zane-neo merged commit 04bf2a4 into opensearch-project:2.x Nov 22, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants