Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed format for hybrid query results to a single list of scores with delimiter #259

Conversation

martin-gaievski
Copy link
Member

@martin-gaievski martin-gaievski commented Aug 22, 2023

Description

Addressing findings from testing of Hybrid Search on larger datasets (https://github.com/beir-cellar/beir#beers-available-datasets, mainly trec-covid and nfcorpus)

What's included in this change:

  • fix for incorrect scores in case of multiple nodes

Case of multiple nodes has problem of serialization/deserialization of the custom class when data nodes are sending their results to coordinator. With current implementation of serialization and deserialization in core OpenSearch only single array of ScoreDocs will be transferred between nodes.

Change the approach of storing results, now we use existing TopDocs class. Scores will be stored in a single list and each sub-query results will end with a special delimiter score object (see below for exact format). CompoundTopDocs used as DTO only to keep results of multiple sub-queries between normalization processor steps.

  • fix for runtime errors in case of a single shard

Fixing side-effect of approach with a single list and delimiters. Due to implementation of optimization logic in core in such case processors will be called after the actual Fetch phase. For us that means we need to update results of fetch phase, not only query results.

  • change default value for min score from 0.001 to 0.0. This is after consulting with science team, same approach has been used in their experiments.

Example of format used for score list with delimiters:

     *  doc_id | magic_number_1    //start/stop
     *  doc_id | magic_number_2   //delimiter for sub-query 1
     *  ...
     *  doc_id | magic_number_2   //delimiter for sub-query 2
     *  ...
     *  doc_id | magic_number_2   //delimiter for sub-query 3
     *  ...
     *  doc_id | magic_number_1  //start/stop

Issues Resolved

#194

Check List

  • New functionality includes testing.
  • All tests pass
  • New functionality has been documented.
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov
Copy link

codecov bot commented Aug 22, 2023

Codecov Report

Merging #259 (a593a7a) into main (d12f480) will decrease coverage by 0.85%.
The diff coverage is 76.55%.

@@             Coverage Diff              @@
##               main     #259      +/-   ##
============================================
- Coverage     86.29%   85.44%   -0.85%     
- Complexity      340      370      +30     
============================================
  Files            29       30       +1     
  Lines           992     1079      +87     
  Branches        153      165      +12     
============================================
+ Hits            856      922      +66     
- Misses           70       83      +13     
- Partials         66       74       +8     
Files Changed Coverage Δ
...neuralsearch/processor/NormalizationProcessor.java 80.00% <60.00%> (-3.88%) ⬇️
...lsearch/search/query/HybridQueryPhaseSearcher.java 71.42% <62.85%> (-2.16%) ⬇️
...arch/processor/NormalizationProcessorWorkflow.java 81.13% <72.97%> (-6.87%) ⬇️
...r/normalization/L2ScoreNormalizationTechnique.java 83.33% <75.00%> (ø)
...rmalization/MinMaxScoreNormalizationTechnique.java 84.61% <80.00%> (ø)
...ralsearch/processor/combination/ScoreCombiner.java 91.83% <85.71%> (ø)
...arch/search/util/HybridSearchResultFormatUtil.java 85.71% <85.71%> (ø)
...search/neuralsearch/processor/CompoundTopDocs.java 92.30% <92.30%> (ø)
...earch/processor/normalization/ScoreNormalizer.java 100.00% <100.00%> (+20.00%) ⬆️

@martin-gaievski martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch 6 times, most recently from 0dff588 to 24c4fd7 Compare August 25, 2023 05:08
@martin-gaievski martin-gaievski changed the title Changed approach for storing hybrid query results from compound top docs to single list of scores with delimiter Changed format for hybrid query results to a single list of scores with delimiter Aug 25, 2023
@martin-gaievski martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch from 24c4fd7 to 08abf7b Compare August 25, 2023 16:17
@martin-gaievski martin-gaievski added backport 2.x Label will add auto workflow to backport PR to 2.x branch Enhancements Increases software capabilities beyond original client specifications labels Aug 25, 2023
@martin-gaievski martin-gaievski marked this pull request as ready for review August 25, 2023 16:18
@martin-gaievski martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch from 08abf7b to 6d366f4 Compare August 25, 2023 16:27
@navneet1v
Copy link
Collaborator

@martin-gaievski can you run the spotlessApply as the build is failing.

…ocs to signle list of scores with delimiter

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch from 6d366f4 to 4760545 Compare August 25, 2023 21:04
@martin-gaievski
Copy link
Member Author

@martin-gaievski can you run the spotlessApply as the build is failing.

done, not spotless checks are passing, thank you @navneet1v

return new TopDocs(totalHits, scoreDocs);
}
// format scores using following template:
// doc_id | magic_number_1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need start and end mark?
Can we just do like

...
doc_id | delimiter
...
doc_id | delimiter
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to indicate the beginning and end of the result series, it makes the logic for parsing cleaner / simpler. Another intention was extensibility, if we need to add something to the "header" part it can be between first start/stop and first delimiter. From resource consumption there is no much difference which number to send, start/stop or delimiter

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not about resource consumption but more about readability.
When you have a list of data and delimiter, it is natural to think delimiter exist only between item.
For example, with ,a,b,c people naturally think there might be an empty item at the first.
When you need to add a header, you should have different marker instead. Otherwise, it will be more confusing if it is header or just another item. For example, e, f, g, h, is e is header or item?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know the format, and it's not supposed to be a human-readable. It was thinking about from extensibility of the format perspective, say we want to add general info for the whole result collection, not for the single-query. E.g. "from" page for pagination, or query with min and max results, something of this sort.

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch from eb2e29c to ae6c0c7 Compare August 28, 2023 16:24
…ew comments

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski added the backport 2.10 backport to 2.10 branch label Aug 28, 2023
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the change_compound_topdocs_to_list_w_delimiters branch from 72162cf to d33370e Compare August 29, 2023 17:57
…nsistent shard results

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski merged commit 75b59cd into opensearch-project:main Aug 29, 2023
14 of 16 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 29, 2023
…th delimiter (#259)

* Changed approach for storing hybrid query results from compound top docs to signle list of scores with delimiter

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
(cherry picked from commit 75b59cd)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 29, 2023
…th delimiter (#259)

* Changed approach for storing hybrid query results from compound top docs to signle list of scores with delimiter

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
(cherry picked from commit 75b59cd)
martin-gaievski added a commit that referenced this pull request Aug 29, 2023
…th delimiter (#259) (#266)

* Changed approach for storing hybrid query results from compound top docs to signle list of scores with delimiter

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
(cherry picked from commit 75b59cd)

Co-authored-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski added a commit that referenced this pull request Aug 29, 2023
…th delimiter (#259) (#267)

* Changed approach for storing hybrid query results from compound top docs to signle list of scores with delimiter

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
(cherry picked from commit 75b59cd)

Co-authored-by: Martin Gaievski <gaievski@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.10 backport to 2.10 branch Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants