Changed format for hybrid query results to a single list of scores with delimiter #259

martin-gaievski · 2023-08-22T16:24:39Z

Description

Addressing findings from testing of Hybrid Search on larger datasets (https://github.com/beir-cellar/beir#beers-available-datasets, mainly trec-covid and nfcorpus)

What's included in this change:

fix for incorrect scores in case of multiple nodes

Case of multiple nodes has problem of serialization/deserialization of the custom class when data nodes are sending their results to coordinator. With current implementation of serialization and deserialization in core OpenSearch only single array of ScoreDocs will be transferred between nodes.

Change the approach of storing results, now we use existing TopDocs class. Scores will be stored in a single list and each sub-query results will end with a special delimiter score object (see below for exact format). CompoundTopDocs used as DTO only to keep results of multiple sub-queries between normalization processor steps.

fix for runtime errors in case of a single shard

Fixing side-effect of approach with a single list and delimiters. Due to implementation of optimization logic in core in such case processors will be called after the actual Fetch phase. For us that means we need to update results of fetch phase, not only query results.

change default value for min score from 0.001 to 0.0. This is after consulting with science team, same approach has been used in their experiments.

Example of format used for score list with delimiters:

     *  doc_id | magic_number_1    //start/stop
     *  doc_id | magic_number_2   //delimiter for sub-query 1
     *  ...
     *  doc_id | magic_number_2   //delimiter for sub-query 2
     *  ...
     *  doc_id | magic_number_2   //delimiter for sub-query 3
     *  ...
     *  doc_id | magic_number_1  //start/stop

Issues Resolved

#194

Check List

New functionality includes testing.
All tests pass
New functionality has been documented.
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2023-08-22T16:29:04Z

Codecov Report

Merging #259 (a593a7a) into main (d12f480) will decrease coverage by 0.85%.
The diff coverage is 76.55%.

@@             Coverage Diff              @@
##               main     #259      +/-   ##
============================================
- Coverage     86.29%   85.44%   -0.85%     
- Complexity      340      370      +30     
============================================
  Files            29       30       +1     
  Lines           992     1079      +87     
  Branches        153      165      +12     
============================================
+ Hits            856      922      +66     
- Misses           70       83      +13     
- Partials         66       74       +8

Files Changed	Coverage Δ
...neuralsearch/processor/NormalizationProcessor.java	`80.00% <60.00%> (-3.88%)`	⬇️
...lsearch/search/query/HybridQueryPhaseSearcher.java	`71.42% <62.85%> (-2.16%)`	⬇️
...arch/processor/NormalizationProcessorWorkflow.java	`81.13% <72.97%> (-6.87%)`	⬇️
...r/normalization/L2ScoreNormalizationTechnique.java	`83.33% <75.00%> (ø)`
...rmalization/MinMaxScoreNormalizationTechnique.java	`84.61% <80.00%> (ø)`
...ralsearch/processor/combination/ScoreCombiner.java	`91.83% <85.71%> (ø)`
...arch/search/util/HybridSearchResultFormatUtil.java	`85.71% <85.71%> (ø)`
...search/neuralsearch/processor/CompoundTopDocs.java	`92.30% <92.30%> (ø)`
...earch/processor/normalization/ScoreNormalizer.java	`100.00% <100.00%> (+20.00%)`	⬆️

navneet1v · 2023-08-25T21:00:39Z

@martin-gaievski can you run the spotlessApply as the build is failing.

…ocs to signle list of scores with delimiter Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski · 2023-08-25T21:22:08Z

@martin-gaievski can you run the spotlessApply as the build is failing.

done, not spotless checks are passing, thank you @navneet1v

src/main/java/org/opensearch/neuralsearch/processor/CompoundTopDocs.java

src/main/java/org/opensearch/neuralsearch/search/query/HybridQueryPhaseSearcher.java

heemin32 · 2023-08-25T21:33:29Z

src/main/java/org/opensearch/neuralsearch/search/query/HybridQueryPhaseSearcher.java

+                return new TopDocs(totalHits, scoreDocs);
+            }
+            // format scores using following template:
+            // doc_id | magic_number_1


Do we really need start and end mark?
Can we just do like

... doc_id | delimiter ... doc_id | delimiter ...

I wanted to indicate the beginning and end of the result series, it makes the logic for parsing cleaner / simpler. Another intention was extensibility, if we need to add something to the "header" part it can be between first start/stop and first delimiter. From resource consumption there is no much difference which number to send, start/stop or delimiter

Not about resource consumption but more about readability.
When you have a list of data and delimiter, it is natural to think delimiter exist only between item.
For example, with ,a,b,c people naturally think there might be an empty item at the first.
When you need to add a header, you should have different marker instead. Otherwise, it will be more confusing if it is header or just another item. For example, e, f, g, h, is e is header or item?

We know the format, and it's not supposed to be a human-readable. It was thinking about from extensibility of the format perspective, say we want to add general info for the whole result collection, not for the single-query. E.g. "from" page for pagination, or query with min and max results, something of this sort.

src/main/java/org/opensearch/neuralsearch/search/query/HybridQueryPhaseSearcher.java

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

…ew comments Signed-off-by: Martin Gaievski <gaievski@amazon.com>

src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessor.java

src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java

…nsistent shard results Signed-off-by: Martin Gaievski <gaievski@amazon.com>

…th delimiter (#259) * Changed approach for storing hybrid query results from compound top docs to signle list of scores with delimiter Signed-off-by: Martin Gaievski <gaievski@amazon.com> (cherry picked from commit 75b59cd)

…th delimiter (#259) (#266) * Changed approach for storing hybrid query results from compound top docs to signle list of scores with delimiter Signed-off-by: Martin Gaievski <gaievski@amazon.com> (cherry picked from commit 75b59cd) Co-authored-by: Martin Gaievski <gaievski@amazon.com>

…th delimiter (#259) (#267) * Changed approach for storing hybrid query results from compound top docs to signle list of scores with delimiter Signed-off-by: Martin Gaievski <gaievski@amazon.com> (cherry picked from commit 75b59cd) Co-authored-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch 6 times, most recently from 0dff588 to 24c4fd7 Compare August 25, 2023 05:08

martin-gaievski changed the title ~~Changed approach for storing hybrid query results from compound top docs to single list of scores with delimiter~~ Changed format for hybrid query results to a single list of scores with delimiter Aug 25, 2023

martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch from 24c4fd7 to 08abf7b Compare August 25, 2023 16:17

martin-gaievski added backport 2.x Label will add auto workflow to backport PR to 2.x branch Enhancements Increases software capabilities beyond original client specifications labels Aug 25, 2023

martin-gaievski marked this pull request as ready for review August 25, 2023 16:18

martin-gaievski requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, sean-zheng-amazon, model-collapse, wujunshen, zane-neo, ylwu-amzn and jngz-es as code owners August 25, 2023 16:18

martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch from 08abf7b to 6d366f4 Compare August 25, 2023 16:27

martin-gaievski mentioned this pull request Aug 25, 2023

[Backport 2.x] Added Score Normalization and Combination feature, manual backport #263

Merged

Changed approach for storing hybrid query results from compound top d…

4760545

…ocs to signle list of scores with delimiter Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch from 6d366f4 to 4760545 Compare August 25, 2023 21:04

heemin32 reviewed Aug 25, 2023

View reviewed changes

martin-gaievski requested a review from heemin32 August 26, 2023 01:08

Addressed multiple review comments

ae6c0c7

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch from eb2e29c to ae6c0c7 Compare August 28, 2023 16:24

Check fetch results notNull instead of number of shards 1, minor revi…

03e43fc

…ew comments Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski added the backport 2.10 backport to 2.10 branch label Aug 28, 2023

heemin32 approved these changes Aug 28, 2023

View reviewed changes

navneet1v reviewed Aug 29, 2023

View reviewed changes

Refactor code, adding exceptions for case of inconsistent internal state

d33370e

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch from 72162cf to d33370e Compare August 29, 2023 17:57

martin-gaievski requested a review from navneet1v August 29, 2023 18:42

navneet1v reviewed Aug 29, 2023

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java Outdated Show resolved Hide resolved

navneet1v reviewed Aug 29, 2023

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java Outdated Show resolved Hide resolved

Added more details to exception message, removed log for case of inco…

a593a7a

…nsistent shard results Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski requested a review from navneet1v August 29, 2023 19:41

navneet1v approved these changes Aug 29, 2023

View reviewed changes

martin-gaievski merged commit 75b59cd into opensearch-project:main Aug 29, 2023
14 of 16 checks passed

opensearch-trigger-bot bot mentioned this pull request Aug 29, 2023

[Backport 2.x] Changed format for hybrid query results to a single list of scores with delimiter #266

Merged

opensearch-trigger-bot bot mentioned this pull request Aug 29, 2023

[Backport 2.10] Changed format for hybrid query results to a single list of scores with delimiter #267

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed format for hybrid query results to a single list of scores with delimiter #259

Changed format for hybrid query results to a single list of scores with delimiter #259

martin-gaievski commented Aug 22, 2023 •

edited

Loading

codecov bot commented Aug 22, 2023 •

edited

Loading

navneet1v commented Aug 25, 2023

martin-gaievski commented Aug 25, 2023

heemin32 Aug 25, 2023

martin-gaievski Aug 26, 2023

heemin32 Aug 28, 2023

martin-gaievski Aug 28, 2023

Changed format for hybrid query results to a single list of scores with delimiter #259

Changed format for hybrid query results to a single list of scores with delimiter #259

Conversation

martin-gaievski commented Aug 22, 2023 • edited Loading

Description

Issues Resolved

Check List

codecov bot commented Aug 22, 2023 • edited Loading

Codecov Report

navneet1v commented Aug 25, 2023

martin-gaievski commented Aug 25, 2023

heemin32 Aug 25, 2023

Choose a reason for hiding this comment

martin-gaievski Aug 26, 2023

Choose a reason for hiding this comment

heemin32 Aug 28, 2023

Choose a reason for hiding this comment

martin-gaievski Aug 28, 2023

Choose a reason for hiding this comment

martin-gaievski commented Aug 22, 2023 •

edited

Loading

codecov bot commented Aug 22, 2023 •

edited

Loading