MB-69641: Fix duplicate results when performing KNN search over docs with multiple vectors #2258

CascadingRadium · 2025-12-03T12:06:51Z

KNN search may yield multiple hits for the same document when a field contains multiple vectors (e.g., vectors inside JSON object arrays). Both the KNNSearcher and KNNCollector can therefore emit duplicate document IDs.
After merging shard/partition results, the coordinator selects the global top-K vectors without regard to document boundaries, so duplicates can persist into the final hit set.
In a single-sharded index, a random vector per document is chosen, instead of the best scoring one, resulting in ranking inconsistency.
The fix ensures that once the vector result is finalized, we add a preprocessing step in the finalization pipeline, which deduplicates the hits by using maximal score breakdown.

Copilot

Pull request overview

This pull request fixes a bug where KNN search returns duplicate documents when a field contains multiple vectors (e.g., vectors inside JSON object arrays). The fix implements deduplication logic that merges duplicate document IDs and retains the maximum score per KNN query for each document.

Key changes:

Adds deduplication logic in finalizeKNNResults that sorts hits by document ID and merges duplicate entries by taking the max score per KNN query index
Introduces comprehensive test coverage for vector object arrays, including both single-vector and multi-vector document scenarios
Ensures that after merging partition results, the final hit set contains only unique documents with their best KNN score contributions

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
search_knn.go	Implements deduplication logic (lines 499-532) that sorts hits by document ID, merges duplicate documents by taking max scores per KNN query, and preserves unique documents in the final result set
search_knn_test.go	Adds `TestVectorObjectArray` function that validates KNN search behavior with both single-vector documents and multi-vector documents (arrays of vector objects), ensuring proper deduplication and score handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

search_knn.go

search_knn_test.go

search_knn.go

abhinavdangeti

@CascadingRadium would you add a perf benchmark to compare the impact before and after this change, so we know what to expect.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

search_knn_test.go

search/collector/knn.go

search/util.go

search/collector/knn.go

search_knn_test.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

search/collector/knn.go

abhinavdangeti

@CascadingRadium 👍🏼. Make sure to do some end-to-end testing to confirm this works as expected.

CascadingRadium · 2025-12-08T06:14:37Z

hey @abhinavdangeti, so regarding your suggestion of merging at the shard level rather than at the coordinator node: I believe that it does not work and results in ranking variance.
Consider the following scenario

Assume we do a K=3 search on the index, and the index has 5 vectors v1..v5 in 
3 documents A, B and C, with similarity scores as below:
----------------------------------------------------------------------------------------
One Partition

A -> v1(0.99), v2(0.98), v3(0.97)
B -> v4(0.96)
C -> v5(0.95)

We get v1, v2, v3 => (A) as the final result
----------------------------------------------------------------------------------------
Two Partition

PartA
A -> v1(0.99), v2(0.98), v3(0.97) => send v1(0.99) to coordinator

PartB
B -> v4(0.96) => send v4(0.96) to coordinator
C -> v5(0.95) => send v4(0.95) to coordinator

We get v1, v4, v5 => (A, B, C) as the final result
----------------------------------------------------------------------------------------
By processing at the coordinator node, we get all the vectors at the coordinator instead:
=> v1(0.99), v2(0.98), v3(0.97), v4(0.96), and v5(0.95) 
     => pick v1(0.99), v2(0.98), v3(0.97) => dedup to (A) as the final result

So, because of this, I am going to revert to the old approach of merging at the coordinator node following a sort operation. In that approach we ensure that the merge happens once the global top k vectors are determined, rather than the local topK.

Thanks

…2260) - When indexing multi-vector fields (e.g., `[[3,0,0], [0,4,0]]`) with `cosine` similarity, normalization was incorrectly applied to the entire flattened array instead of each sub-vector independently, resulting in degraded similarity scores. - Added `NormalizeMultiVector(vec, dims)` that normalizes each sub-vector separately, fixing scores for multi-vector documents (e.g., score now correctly returns 1.0 instead of 0.6 for exact matches).

CascadingRadium requested review from Likith101, Thejas-bhat, abhinavdangeti, capemox, Copilot and maneuvertomars December 3, 2025 12:07

CascadingRadium changed the title ~~Fix duplicate results when performing KNN search~~ MB-69641: Fix duplicate results when performing KNN search Dec 3, 2025

Copilot started reviewing on behalf of CascadingRadium December 3, 2025 12:07 View session

Copilot finished reviewing on behalf of CascadingRadium December 3, 2025 12:11

Copilot AI reviewed Dec 3, 2025

View reviewed changes

search_knn.go Outdated Show resolved Hide resolved

search_knn_test.go Outdated Show resolved Hide resolved

search_knn.go Outdated Show resolved Hide resolved

search_knn.go Outdated Show resolved Hide resolved

abhinavdangeti changed the title ~~MB-69641: Fix duplicate results when performing KNN search~~ MB-69641: Fix duplicate results when performing KNN search over docs with multiple vectors Dec 3, 2025

abhinavdangeti reviewed Dec 3, 2025

View reviewed changes

abhinavdangeti modified the milestones: v2.6.0, v2.5.7 Dec 3, 2025

CascadingRadium marked this pull request as draft December 4, 2025 11:31

CascadingRadium added 4 commits December 4, 2025 19:32

Fix duplicate results when performing KNN search

d654a70

code review

7e65ecd

fix dedup logic

8838f89

unit test

42c98f1

CascadingRadium force-pushed the knnDup branch from 45d197a to 42c98f1 Compare December 4, 2025 16:10

CascadingRadium requested review from abhinavdangeti and Copilot December 4, 2025 16:24

CascadingRadium marked this pull request as ready for review December 4, 2025 16:24

Copilot started reviewing on behalf of CascadingRadium December 4, 2025 16:25 View session

Copilot finished reviewing on behalf of CascadingRadium December 4, 2025 16:26

Copilot AI reviewed Dec 4, 2025

View reviewed changes

CascadingRadium and others added 3 commits December 4, 2025 21:58

fix

c4dd9d4

Apply suggestions from code review

351d8be

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

go fmt ./...

2db8199

CascadingRadium added 3 commits December 4, 2025 22:52

fix total calc

a5fd255

fix edge case

b914204

fix test

6b153a0

abhinavdangeti reviewed Dec 4, 2025

View reviewed changes

search/collector/knn.go Outdated Show resolved Hide resolved

Fix interface

8721d16

abhinavdangeti previously approved these changes Dec 5, 2025

View reviewed changes

CascadingRadium marked this pull request as draft December 7, 2025 06:14

CascadingRadium dismissed abhinavdangeti’s stale review via a233b67 December 8, 2025 06:18

CascadingRadium added 2 commits December 8, 2025 13:37

revert

dd2422d

fix

f3540a6

CascadingRadium requested a review from abhinavdangeti December 8, 2025 08:49

fix

fcb0d76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MB-69641: Fix duplicate results when performing KNN search over docs with multiple vectors #2258

MB-69641: Fix duplicate results when performing KNN search over docs with multiple vectors #2258

Uh oh!

CascadingRadium commented Dec 3, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhinavdangeti left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhinavdangeti left a comment

Uh oh!

CascadingRadium commented Dec 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MB-69641: Fix duplicate results when performing KNN search over docs with multiple vectors #2258

Are you sure you want to change the base?

MB-69641: Fix duplicate results when performing KNN search over docs with multiple vectors #2258

Uh oh!

Conversation

CascadingRadium commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhinavdangeti left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhinavdangeti left a comment

Choose a reason for hiding this comment

Uh oh!

CascadingRadium commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CascadingRadium commented Dec 3, 2025 •

edited

Loading

CascadingRadium commented Dec 8, 2025 •

edited

Loading