fix: deduplicate row addresses in take to prevent panic #5881
+190
−16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, duplicate row addresses (e.g. from FTS on
List<Utf8>wheremultiple list elements in the same row match) were passed to the v2 encoding
layer, which requires strictly increasing indices. This caused
indices_to_rangesto produce overlapping ranges, panicking with "attempt tosubtract with overflow".
Dedup is handled at two sites:
FragmentReader::take_as_batch: This is the common entry point for allcallers that need a single batch (
FileFragment::take_rows,FragmentSession::take_rows,TakeStream::map_batch). Since it alreadycollects into one batch, the expand-after-dedup step is trivial. Handling it
here rather than in the v2 adapter's
take_all_tasksavoids complicationswith the streaming
ReadBatchTaskStreamreturn type. Handling it lower inschedule_takewouldn't work because the encoding layer has no notion ofduplicates by design.
TakeStream::map_batch: Dedup here serves a different purpose — itreduces I/O by avoiding redundant reads of the same row across fragments.
This also optimizes the common path (sorted, unique addresses) to skip all
sorting/dedup/permutation overhead entirely.
Fixes #5260
Test plan
test_take_with_duplicate_row_addrs— sorted duplicate row addressestest_take_with_unsorted_duplicate_row_addrs— unsorted duplicatestest_fragment_take_indices/test_fragment_take_rows— fragment-levelduplicate indices (previously failing on
LanceFileVersion::Stable)🤖 Generated with Claude Code