Skip to content

Conversation

@wjones127
Copy link
Contributor

@wjones127 wjones127 commented Feb 3, 2026

Previously, duplicate row addresses (e.g. from FTS on List<Utf8> where
multiple list elements in the same row match) were passed to the v2 encoding
layer, which requires strictly increasing indices. This caused
indices_to_ranges to produce overlapping ranges, panicking with "attempt to
subtract with overflow".

Dedup is handled at two sites:

  • FragmentReader::take_as_batch: This is the common entry point for all
    callers that need a single batch (FileFragment::take_rows,
    FragmentSession::take_rows, TakeStream::map_batch). Since it already
    collects into one batch, the expand-after-dedup step is trivial. Handling it
    here rather than in the v2 adapter's take_all_tasks avoids complications
    with the streaming ReadBatchTaskStream return type. Handling it lower in
    schedule_take wouldn't work because the encoding layer has no notion of
    duplicates by design.

  • TakeStream::map_batch: Dedup here serves a different purpose — it
    reduces I/O by avoiding redundant reads of the same row across fragments.
    This also optimizes the common path (sorted, unique addresses) to skip all
    sorting/dedup/permutation overhead entirely.

Fixes #5260

Test plan

  • test_take_with_duplicate_row_addrs — sorted duplicate row addresses
  • test_take_with_unsorted_duplicate_row_addrs — unsorted duplicates
  • test_fragment_take_indices / test_fragment_take_rows — fragment-level
    duplicate indices (previously failing on LanceFileVersion::Stable)
  • All 57 take-related tests pass

🤖 Generated with Claude Code

TakeStream::map_batch passed duplicate row addresses straight through to
the encoding layer, which requires strictly increasing indices.
Duplicates (e.g. from FTS on List<Utf8> where multiple list elements in
the same row match) caused indices_to_ranges to produce overlapping
ranges, panicking in BinaryPageScheduler with "attempt to subtract with
overflow".

Dedup sorted addresses before passing them to fragment readers, then
expand the results back to include duplicates. Also tighten the
schedule_take debug_assert from <= to < to catch this earlier.

Fixes lance-format#5260

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot added the bug Something isn't working label Feb 3, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

Code Review

This PR correctly fixes issue #5260 by deduplicating row addresses before passing them to the encoding layer which requires strictly increasing indices.

P1 Issue: Use proper error propagation instead of .unwrap()

At line 290 in the updated code, .unwrap() on take_record_batch could panic if it fails. While the existing code on line 272 also uses .unwrap() for the permutation restore, adding proper error propagation with ? would be more defensive.

Consider changing to:

new_data = arrow_select::take::take_record_batch(&new_data, &expand_indices)
    .map_err(|e| DataFusionError::Execution(e.to_string()))?;

The same applies to the existing .unwrap() on line 272, though that is outside the scope of this PR.


Overall the fix is correct and well-tested. The dedup approach is sound: using the sorted order to efficiently track unique addresses and then expanding results back with duplicates is the right pattern.

wjones127 and others added 2 commits February 3, 2026 12:05
The common path (sorted, no duplicates) now does a single-pass
is_sorted_and_unique check and skips all sorting, dedup, and
permutation logic. The unsorted/duplicate paths are unified and
the expand + inverse permutation are composed into a single take.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@wjones127 wjones127 changed the title fix: deduplicate row addresses in TakeStream to prevent panic fix: deduplicate row addresses in take to prevent panic Feb 3, 2026
@wjones127 wjones127 marked this pull request as ready for review February 3, 2026 23:24
@wjones127 wjones127 requested a review from westonpace February 3, 2026 23:24
@wjones127 wjones127 added the critical-fix Bugs that cause crashes, security vulnerabilities, or incorrect data. label Feb 3, 2026
@codecov
Copy link

codecov bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 94.53125% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/io/exec/take.rs 95.45% 5 Missing ⚠️
rust/lance/src/dataset/fragment.rs 88.23% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working critical-fix Bugs that cause crashes, security vulnerabilities, or incorrect data.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subtract with overflow panic while decoding binary data

1 participant