Deprecate max_token_score field of neural_sparse query #478

zhichao-aws · 2023-10-30T02:43:24Z

Description

In neural_sparse query, max_token_score field was used for sub-clause pruning of WAND scorer (lucene 9.7). Since we'll upgrade to lucene 9.8 in the next release, the inner logic in lucene changed and we don't need this field any more. We need to deprecate this field. To be more specific, in 2.x, user can still set this field but we'll ignore this and give a warning log. In 3.x, we don't parse this field.

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2023-10-30T03:39:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (ef19ffa) 80.95% compared to head (617f154) 85.53%.

Additional details and impacted files

@@             Coverage Diff              @@
##                2.x     #478      +/-   ##
============================================
+ Coverage     80.95%   85.53%   +4.58%     
- Complexity      512      516       +4     
============================================
  Files            41       40       -1     
  Lines          1591     1521      -70     
  Branches        247      238       -9     
============================================
+ Hits           1288     1301      +13     
+ Misses          197      112      -85     
- Partials        106      108       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

model-collapse

Check the coverage thing.

src/main/java/org/opensearch/neuralsearch/query/NeuralSparseQueryBuilder.java

navneet1v · 2023-11-03T07:17:39Z

Since we'll upgrade to lucene 9.8 in the next release, the inner logic in lucene changed and we don't need this field any more.

@zhichao-aws can you share some light around how lucene is taking care of this? Does lucene has added this capability? how does this impact the customer queries and search relevancy who is using this field.

navneet1v · 2023-11-03T07:18:22Z

@zhichao-aws can you also add results of the tests which has been done, to ensure that if customer doesn't provide this deprecated field the queries are not impacted.

navneet1v · 2023-11-03T07:21:41Z

@zhichao-aws can you also add details on how we tested the upgrades?

navneet1v · 2023-11-03T07:19:11Z

src/main/java/org/opensearch/neuralsearch/query/NeuralSparseQueryBuilder.java

    @VisibleForTesting
-    static final ParseField MAX_TOKEN_SCORE_FIELD = new ParseField("max_token_score");
+    static final ParseField MAX_TOKEN_SCORE_FIELD = new ParseField("max_token_score").withAllDeprecated();


can we add the @deprecated annotation on top of this.

navneet1v · 2023-11-03T07:22:56Z

src/test/java/org/opensearch/neuralsearch/query/NeuralSparseQueryBuilderTests.java

@@ -88,7 +93,6 @@ public void testFromXContent_whenBuiltWithOptionals_thenBuildSuccessfully() {
              "VECTOR_FIELD": {
                "query_text": "string",
                "model_id": "string",
-                "max_token_score": 123.0,


can we keep a unit test where we are providing the deprecated field and no impact on queries are happening.

I think we already have this.

zhichao-aws · 2023-11-03T09:36:10Z

Since we'll upgrade to lucene 9.8 in the next release, the inner logic in lucene changed and we don't need this field any more.

@zhichao-aws can you share some light around how lucene is taking care of this? Does lucene has added this capability? how does this impact the customer queries and search relevancy who is using this field.

Please refer this lucene PR for more details. In short words, Lucene used ImpactsDISI to skip hits based on minimum competitive score. But lucene contributors found ImpactsDISI does hurt more than it helps on average, because ImpactsDISI adds quite some overhead and the per-clauses minimum scores are usually so low that they don't actually enable skipping hits. Now only top level scoring clause will use ImpactsDISI for pruning. FeatureQuery is not the top level scoring clause, and it won't use the ImpactsDISI. https://github.com/apache/lucene/pull/12490/files#diff-6ca6d673f9d09efdca430f2d5a381fbd862fd74385b778e45ade26fe112bca85

So in our plugin, we don't need to provide the score upperbound for FeatureQuery after lucene 9.8. We conducted some tests, this does reduce neural_sparse latency at a large margin. For 8.8 million docs case the latency was reduced to about 1/3 compared with lucene 9.7 case. This optimization doesn't affect the search result, only improves the speed of searching in shards. So it won't hurt the search relevance. Since we still keep the api compatibility in 2.x, users can still use existing queries, we'll ignore the max_token_score field and give a warning log.

zhichao-aws · 2023-11-03T09:42:10Z

@zhichao-aws can you also add results of the tests which has been done, to ensure that if customer doesn't provide this deprecated field the queries are not impacted.

The deprecated field was optional in 2.11 release, so customers can always not privide this field.

zhichao-aws · 2023-11-03T09:48:22Z

@zhichao-aws can you also add details on how we tested the upgrades?

We have integ test for setting and doesn't setting this field:
https://github.com/zhichao-aws/neural-search/blob/dbda2c45638d141042793eeab0614742b898db3c/src/test/java/org/opensearch/neuralsearch/query/NeuralSparseQueryIT.java#L97
https://github.com/zhichao-aws/neural-search/blob/dbda2c45638d141042793eeab0614742b898db3c/src/test/java/org/opensearch/neuralsearch/query/NeuralSparseQueryIT.java#L68

We also have unit test to check we still parse this field, but will log a warning.
https://github.com/zhichao-aws/neural-search/blob/dbda2c45638d141042793eeab0614742b898db3c/src/test/java/org/opensearch/neuralsearch/query/NeuralSparseQueryBuilderTests.java#L123
When user provides this field, we ignore this and get the lucene query we want
https://github.com/zhichao-aws/neural-search/blob/dbda2c45638d141042793eeab0614742b898db3c/src/test/java/org/opensearch/neuralsearch/query/NeuralSparseQueryBuilderTests.java#L507

zhichao-aws · 2023-11-09T03:15:28Z

Hi @navneet1v , I've added some comments and commits, could you please help review again?

zhichao-aws · 2023-11-13T06:17:49Z

BTW, this should be merged before we bump version to 2.12.0. For the old code there will be compilation errors due to the changes of lucene internal interface.

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

zhichao-aws · 2023-11-22T09:31:24Z

force push to rebase 2.x

model-collapse

LGTM

zhichao-aws requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, wujunshen, zane-neo, ylwu-amzn and jngz-es as code owners October 30, 2023 02:43

zhichao-aws marked this pull request as draft October 30, 2023 03:02

zhichao-aws marked this pull request as ready for review October 30, 2023 03:41

zane-neo approved these changes Nov 3, 2023

View reviewed changes

model-collapse reviewed Nov 3, 2023

View reviewed changes

src/main/java/org/opensearch/neuralsearch/query/NeuralSparseQueryBuilder.java Show resolved Hide resolved

navneet1v reviewed Nov 3, 2023

View reviewed changes

zhichao-aws added 2 commits November 22, 2023 17:30

rm bounded linear feature query

8394134

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

deprecate max_token_score

dd0b44b

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

zhichao-aws added 5 commits November 22, 2023 17:30

add changelog

354560a

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

tidy

daef81e

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

fix ut

f52c443

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

add ut

79d25e3

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

add deprecation annotation

617f154

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

zhichao-aws force-pushed the 2.x branch from 3eeb336 to 617f154 Compare November 22, 2023 09:30

model-collapse self-requested a review November 22, 2023 09:56

model-collapse approved these changes Nov 22, 2023

View reviewed changes

zane-neo merged commit 04bf2a4 into opensearch-project:2.x Nov 22, 2023
15 checks passed

zhichao-aws mentioned this pull request Mar 1, 2024

Deprecate max_token_score in neural sparse search opensearch-project/documentation-website#6554

Merged

1 task

zhichao-aws mentioned this pull request Apr 18, 2024

[BUG FIX] Fix bwc failure in neural sparse search #696

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate max_token_score field of neural_sparse query #478

Deprecate max_token_score field of neural_sparse query #478

zhichao-aws commented Oct 30, 2023 •

edited

Loading

codecov bot commented Oct 30, 2023 •

edited

Loading

model-collapse left a comment

navneet1v commented Nov 3, 2023

navneet1v commented Nov 3, 2023

navneet1v commented Nov 3, 2023

navneet1v Nov 3, 2023

zhichao-aws Nov 3, 2023

navneet1v Nov 3, 2023

zhichao-aws Nov 3, 2023

zhichao-aws commented Nov 3, 2023 •

edited

Loading

zhichao-aws commented Nov 3, 2023

zhichao-aws commented Nov 3, 2023

zhichao-aws commented Nov 9, 2023

zhichao-aws commented Nov 13, 2023

zhichao-aws commented Nov 22, 2023

model-collapse left a comment

Deprecate max_token_score field of neural_sparse query #478

Deprecate max_token_score field of neural_sparse query #478

Conversation

zhichao-aws commented Oct 30, 2023 • edited Loading

Description

Check List

codecov bot commented Oct 30, 2023 • edited Loading

Codecov Report

model-collapse left a comment

Choose a reason for hiding this comment

navneet1v commented Nov 3, 2023

navneet1v commented Nov 3, 2023

navneet1v commented Nov 3, 2023

navneet1v Nov 3, 2023

Choose a reason for hiding this comment

zhichao-aws Nov 3, 2023

Choose a reason for hiding this comment

navneet1v Nov 3, 2023

Choose a reason for hiding this comment

zhichao-aws Nov 3, 2023

Choose a reason for hiding this comment

zhichao-aws commented Nov 3, 2023 • edited Loading

zhichao-aws commented Nov 3, 2023

zhichao-aws commented Nov 3, 2023

zhichao-aws commented Nov 9, 2023

zhichao-aws commented Nov 13, 2023

zhichao-aws commented Nov 22, 2023

model-collapse left a comment

Choose a reason for hiding this comment

zhichao-aws commented Oct 30, 2023 •

edited

Loading

codecov bot commented Oct 30, 2023 •

edited

Loading

zhichao-aws commented Nov 3, 2023 •

edited

Loading