-
Notifications
You must be signed in to change notification settings - Fork 698
MB-69641: Fix duplicate results when performing KNN search over docs with multiple vectors #2258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request fixes a bug where KNN search returns duplicate documents when a field contains multiple vectors (e.g., vectors inside JSON object arrays). The fix implements deduplication logic that merges duplicate document IDs and retains the maximum score per KNN query for each document.
Key changes:
- Adds deduplication logic in
finalizeKNNResultsthat sorts hits by document ID and merges duplicate entries by taking the max score per KNN query index - Introduces comprehensive test coverage for vector object arrays, including both single-vector and multi-vector document scenarios
- Ensures that after merging partition results, the final hit set contains only unique documents with their best KNN score contributions
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| search_knn.go | Implements deduplication logic (lines 499-532) that sorts hits by document ID, merges duplicate documents by taking max scores per KNN query, and preserves unique documents in the final result set |
| search_knn_test.go | Adds TestVectorObjectArray function that validates KNN search behavior with both single-vector documents and multi-vector documents (arrays of vector objects), ensuring proper deduplication and score handling |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
abhinavdangeti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CascadingRadium would you add a perf benchmark to compare the impact before and after this change, so we know what to expect.
45d197a to
42c98f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
abhinavdangeti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CascadingRadium 👍🏼. Make sure to do some end-to-end testing to confirm this works as expected.
|
hey @abhinavdangeti, so regarding your suggestion of merging at the shard level rather than at the coordinator node: I believe that it does not work and results in ranking variance. So, because of this, I am going to revert to the old approach of merging at the coordinator node following a sort operation. In that approach we ensure that the merge happens once the global top k vectors are determined, rather than the local topK. Thanks |
…2260) - When indexing multi-vector fields (e.g., `[[3,0,0], [0,4,0]]`) with `cosine` similarity, normalization was incorrectly applied to the entire flattened array instead of each sub-vector independently, resulting in degraded similarity scores. - Added `NormalizeMultiVector(vec, dims)` that normalizes each sub-vector separately, fixing scores for multi-vector documents (e.g., score now correctly returns 1.0 instead of 0.6 for exact matches).
KNNSearcherandKNNCollectorcan therefore emit duplicate document IDs.