Skip to content

Conversation

@CascadingRadium
Copy link
Member

@CascadingRadium CascadingRadium commented Dec 3, 2025

  • KNN search may yield multiple hits for the same document when a field contains multiple vectors (e.g., vectors inside JSON object arrays). Both the KNNSearcher and KNNCollector can therefore emit duplicate document IDs.
  • After merging shard/partition results, the coordinator selects the global top-K vectors without regard to document boundaries, so duplicates can persist into the final hit set.
  • In a single-sharded index, a random vector per document is chosen, instead of the best scoring one, resulting in ranking inconsistency.
  • The fix ensures that once the vector result is finalized, we add a preprocessing step in the finalization pipeline, which deduplicates the hits by using maximal score breakdown.

@CascadingRadium CascadingRadium changed the title Fix duplicate results when performing KNN search MB-69641: Fix duplicate results when performing KNN search Dec 3, 2025
Copilot finished reviewing on behalf of CascadingRadium December 3, 2025 12:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a bug where KNN search returns duplicate documents when a field contains multiple vectors (e.g., vectors inside JSON object arrays). The fix implements deduplication logic that merges duplicate document IDs and retains the maximum score per KNN query for each document.

Key changes:

  • Adds deduplication logic in finalizeKNNResults that sorts hits by document ID and merges duplicate entries by taking the max score per KNN query index
  • Introduces comprehensive test coverage for vector object arrays, including both single-vector and multi-vector document scenarios
  • Ensures that after merging partition results, the final hit set contains only unique documents with their best KNN score contributions

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
search_knn.go Implements deduplication logic (lines 499-532) that sorts hits by document ID, merges duplicate documents by taking max scores per KNN query, and preserves unique documents in the final result set
search_knn_test.go Adds TestVectorObjectArray function that validates KNN search behavior with both single-vector documents and multi-vector documents (arrays of vector objects), ensuring proper deduplication and score handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@abhinavdangeti abhinavdangeti changed the title MB-69641: Fix duplicate results when performing KNN search MB-69641: Fix duplicate results when performing KNN search over docs with multiple vectors Dec 3, 2025
Copy link
Member

@abhinavdangeti abhinavdangeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CascadingRadium would you add a perf benchmark to compare the impact before and after this change, so we know what to expect.

@abhinavdangeti abhinavdangeti modified the milestones: v2.6.0, v2.5.7 Dec 3, 2025
@CascadingRadium CascadingRadium marked this pull request as draft December 4, 2025 11:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

CascadingRadium and others added 3 commits December 4, 2025 21:58
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
abhinavdangeti
abhinavdangeti previously approved these changes Dec 5, 2025
Copy link
Member

@abhinavdangeti abhinavdangeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CascadingRadium 👍🏼. Make sure to do some end-to-end testing to confirm this works as expected.

@CascadingRadium CascadingRadium marked this pull request as draft December 7, 2025 06:14
@CascadingRadium
Copy link
Member Author

CascadingRadium commented Dec 8, 2025

hey @abhinavdangeti, so regarding your suggestion of merging at the shard level rather than at the coordinator node: I believe that it does not work and results in ranking variance.
Consider the following scenario

Assume we do a K=3 search on the index, and the index has 5 vectors v1..v5 in 
3 documents A, B and C, with similarity scores as below:
----------------------------------------------------------------------------------------
One Partition

A -> v1(0.99), v2(0.98), v3(0.97)
B -> v4(0.96)
C -> v5(0.95)

We get v1, v2, v3 => (A) as the final result
----------------------------------------------------------------------------------------
Two Partition

PartA
A -> v1(0.99), v2(0.98), v3(0.97) => send v1(0.99) to coordinator

PartB
B -> v4(0.96) => send v4(0.96) to coordinator
C -> v5(0.95) => send v4(0.95) to coordinator

We get v1, v4, v5 => (A, B, C) as the final result
----------------------------------------------------------------------------------------
By processing at the coordinator node, we get all the vectors at the coordinator instead:
=> v1(0.99), v2(0.98), v3(0.97), v4(0.96), and v5(0.95) 
     => pick v1(0.99), v2(0.98), v3(0.97) => dedup to (A) as the final result

So, because of this, I am going to revert to the old approach of merging at the coordinator node following a sort operation. In that approach we ensure that the merge happens once the global top k vectors are determined, rather than the local topK.

Thanks

…2260)

- When indexing multi-vector fields (e.g., `[[3,0,0], [0,4,0]]`) with
`cosine` similarity, normalization was incorrectly applied to the entire
flattened array instead of each sub-vector independently, resulting in
degraded similarity scores.
- Added `NormalizeMultiVector(vec, dims)` that normalizes each
sub-vector separately, fixing scores for multi-vector documents (e.g.,
score now correctly returns 1.0 instead of 0.6 for exact matches).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants