Skip to content

feat(scanner): add row_id allowlist support#5831

Open
fredlarochelle wants to merge 10 commits intolance-format:mainfrom
fredlarochelle:feature/rowid-allowlist
Open

feat(scanner): add row_id allowlist support#5831
fredlarochelle wants to merge 10 commits intolance-format:mainfrom
fredlarochelle:feature/rowid-allowlist

Conversation

@fredlarochelle
Copy link

@fredlarochelle fredlarochelle commented Jan 27, 2026

We hit a severe bottleneck when filtering on very large IN (...) lists (millions of comparisons) in scalar filters. We can cheaply pre-compute a bitmap row-ID allowlist in memory, but Lance had no way to inject it into the query pipeline. This PR exposes that path so we can bypass expensive scalar filtering and push the allowlist directory into scan / FTS / vector execution, including IO-level pruning on non-legacy storage. Once this PR is merged, we plan to add LanceDB support.

What this PR does

  • Adds Scanner:row_id_allowlist for supplying a row-ID allowlist (dataset row-ID space).
  • Enforces allowlist on plain scans by intersecting with any TakeOperations, then treating any remaining filters as refine-only.
  • Plumbs allowlist through vector and FTS prefilter paths (ANN/KNN + FTS), including unindexed/flat paths.
  • Adds IO‑level pruning on non‑legacy storage by passing allowlist into FilteredReadExec.
  • Ensures allowlist always intersects with deletion masks and other filters.

Semantics

  • Allowlist is a hard constraint in dataset row-ID space (_rowid)
  • If stable row IDs are disabled, _rowid is row address, allowlist must be snapshot-bound.
  • If stable row IDs are enabled, _rowid is stable across versions.

Tests added

  • Runtime: allowlist with ANN + flat KNN, indexed + unindexed FTS, unstable row IDs, deletions.
  • IO: allowlist mask in FilteredReadExec (including scan range ordering).
  • Plain scan + filter intersection

Performance evidence (local, not committed)

Plain scan with large IN filter vs allowlist:

  • 1e6 rows
    • filter_in_only: ~18.6 ms
    • allowlist_only: ~1.08 ms
    • ~17× faster
  • 1e7 rows
    • filter_in_only: ~215.3 ms
    • allowlist_only: ~6.95 ms
    • ~31x faster

Notes / limitations

  • IO-level allowlist pruning only applies to non-legacy storage.
  • Legacy paths remains compute-level filtering only.

@github-actions
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@fredlarochelle fredlarochelle changed the title Expose row_id allowlist in Scanner + apply across scan/FTS/vector with IO-level pruning feat(scanner): add row_id allowlist support Jan 27, 2026
@github-actions github-actions bot added the enhancement New feature or request label Jan 27, 2026
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool! Coincidentally I was just talking to some colleagues about needing something like that (#5832) so I'm all for it.

Have one initial question about the API now. I'll dig deeper into the implementation tomorrow morning.

///
/// For unstable row ids, the allowlist must come from the same snapshot.
/// For stable row ids, values remain valid across versions.
pub fn row_id_allowlist<I>(&mut self, row_ids: I) -> &mut Self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of just taking an RowAddrTreeMap as input instead?

Copy link
Author

@fredlarochelle fredlarochelle Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried going in with the least change possible. The way I see it, taking a Vec<u64> / slice / iterator as input is very ergonomic for callers and it matches how filters / IN (...) are expressed. It's a simple API surface.

I didn't think about RowAddrTreeMap. Preliminary thoughts: it's more efficient for callers that already have a map/mask and more explicit about row-address space, but it's less ergonomic for common callers (they must construct a RowAddrTreeMap) and it exposes more internal types in the API. On the plus side, it steers users toward the correct semantics.

For your use case in #5832, RowAddrTreeMap makes the most sense since they operate on row-address masks.

Also this is my first dive into Lance internals, I haven't explored in depth, but I'd imagine row_id_allowlist is easier to expose in Python bindings or higher-level code. I think it could be best to keep row_id_allowlist semantics and add a "power user" overload like row_id_allowlist_map(RowAddrTreeMap) or row_id_allowlist_mask(RowAddrMask). That keeps ergonomics while allowing zero-copy use of pre-build masks.

@wjones127 wjones127 self-assigned this Jan 27, 2026
@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 89.17197% with 68 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 85.58% 11 Missing and 51 partials ⚠️
rust/lance/src/io/exec/fts.rs 96.52% 1 Missing and 3 partials ⚠️
rust/lance/src/io/exec/knn.rs 96.87% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable. My main concern is I'm not sure why it's necessary to modify the FTS and KNN exec nodes. Could you explain that?

Comment on lines +315 to +318
fn filter_batch_by_allowlist(
batch: RecordBatch,
allowlist: &RowAddrMask,
) -> DataFusionResult<RecordBatch> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is duplicated with the one in fts.rs, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is intentional for now. The two version differ only in error context ("KNN input" vs "FTS input"), and I kept it local to avoid extra refactoring in this PR. I'm happy to factor it into a shared helper if you prefer.

pub query: ArrayRef,
pub column: String,
pub distance_type: DistanceType,
allowlist_mask: Option<Arc<RowAddrMask>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make much sense to me. I would think allowlist only needs to be in nodes that do IO. Why does this need to be in KNNVectorDistanceExec and MatchQueryExec? Can't we just push down into the scans and the index PreFilter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I push the allowlist into IO (FilteredReadExec) when possible and into the index prefilter. For ANN, the prefilter is constructed in ANNIvfSubIndexExec, and for FTS it's built in MatchQueryExec/PhraseQueryExec, so the allowlist has to be injected there. For flat/unindexed or legacy paths, some scan paths (scan_fragments / LanceScanExec) don’t accept allowlists and some inputs aren’t IO‑backed, so KNNVectorDistanceExec/FlatMatchQueryExec enforce it as a correctness backstop. This also avoid doing distance/score work on rows that will be dropped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I'll look deeper into the code to understand better. I wonder if we somehow combine this code path with the ones that propagate deleted rows 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting thought, I will take a look tomorrow at the deleted rows path. Will report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants