-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed format for hybrid query results to a single list of scores with delimiter #259
Changed format for hybrid query results to a single list of scores with delimiter #259
Conversation
Codecov Report
@@ Coverage Diff @@
## main #259 +/- ##
============================================
- Coverage 86.29% 85.44% -0.85%
- Complexity 340 370 +30
============================================
Files 29 30 +1
Lines 992 1079 +87
Branches 153 165 +12
============================================
+ Hits 856 922 +66
- Misses 70 83 +13
- Partials 66 74 +8
|
0dff588
to
24c4fd7
Compare
24c4fd7
to
08abf7b
Compare
08abf7b
to
6d366f4
Compare
@martin-gaievski can you run the spotlessApply as the build is failing. |
…ocs to signle list of scores with delimiter Signed-off-by: Martin Gaievski <gaievski@amazon.com>
6d366f4
to
4760545
Compare
done, not spotless checks are passing, thank you @navneet1v |
src/main/java/org/opensearch/neuralsearch/processor/CompoundTopDocs.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/CompoundTopDocs.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/CompoundTopDocs.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/CompoundTopDocs.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/CompoundTopDocs.java
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/search/query/HybridQueryPhaseSearcher.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/search/query/HybridQueryPhaseSearcher.java
Outdated
Show resolved
Hide resolved
return new TopDocs(totalHits, scoreDocs); | ||
} | ||
// format scores using following template: | ||
// doc_id | magic_number_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need start and end mark?
Can we just do like
...
doc_id | delimiter
...
doc_id | delimiter
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to indicate the beginning and end of the result series, it makes the logic for parsing cleaner / simpler. Another intention was extensibility, if we need to add something to the "header" part it can be between first start/stop and first delimiter. From resource consumption there is no much difference which number to send, start/stop or delimiter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not about resource consumption but more about readability.
When you have a list of data and delimiter, it is natural to think delimiter exist only between item.
For example, with ,a,b,c
people naturally think there might be an empty item at the first.
When you need to add a header, you should have different marker instead. Otherwise, it will be more confusing if it is header or just another item. For example, e, f, g, h
, is e is header or item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know the format, and it's not supposed to be a human-readable. It was thinking about from extensibility of the format perspective, say we want to add general info for the whole result collection, not for the single-query. E.g. "from" page for pagination, or query with min and max results, something of this sort.
src/main/java/org/opensearch/neuralsearch/search/query/HybridQueryPhaseSearcher.java
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/search/query/HybridQueryPhaseSearcher.java
Show resolved
Hide resolved
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
eb2e29c
to
ae6c0c7
Compare
…ew comments Signed-off-by: Martin Gaievski <gaievski@amazon.com>
src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java
Show resolved
Hide resolved
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
72162cf
to
d33370e
Compare
src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java
Outdated
Show resolved
Hide resolved
…nsistent shard results Signed-off-by: Martin Gaievski <gaievski@amazon.com>
75b59cd
into
opensearch-project:main
Description
Addressing findings from testing of Hybrid Search on larger datasets (https://github.com/beir-cellar/beir#beers-available-datasets, mainly trec-covid and nfcorpus)
What's included in this change:
Case of multiple nodes has problem of serialization/deserialization of the custom class when data nodes are sending their results to coordinator. With current implementation of serialization and deserialization in core OpenSearch only single array of ScoreDocs will be transferred between nodes.
Change the approach of storing results, now we use existing TopDocs class. Scores will be stored in a single list and each sub-query results will end with a special delimiter score object (see below for exact format). CompoundTopDocs used as DTO only to keep results of multiple sub-queries between normalization processor steps.
Fixing side-effect of approach with a single list and delimiters. Due to implementation of optimization logic in core in such case processors will be called after the actual Fetch phase. For us that means we need to update results of fetch phase, not only query results.
Example of format used for score list with delimiters:
Issues Resolved
#194
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.