feat(scanner): add row_id allowlist support by fredlarochelle · Pull Request #5831 · lance-format/lance

fredlarochelle · 2026-01-27T21:16:10Z

We hit a severe bottleneck when filtering on very large IN (...) lists (millions of comparisons) in scalar filters. We can cheaply pre-compute a bitmap row-ID allowlist in memory, but Lance had no way to inject it into the query pipeline. This PR exposes that path so we can bypass expensive scalar filtering and push the allowlist directory into scan / FTS / vector execution, including IO-level pruning on non-legacy storage. Once this PR is merged, we plan to add LanceDB support.

What this PR does

Adds Scanner:row_id_allowlist for supplying a row-ID allowlist (dataset row-ID space).
Enforces allowlist on plain scans by intersecting with any TakeOperations, then treating any remaining filters as refine-only.
Plumbs allowlist through vector and FTS prefilter paths (ANN/KNN + FTS), including unindexed/flat paths.
Adds IO‑level pruning on non‑legacy storage by passing allowlist into FilteredReadExec.
Ensures allowlist always intersects with deletion masks and other filters.

Semantics

Allowlist is a hard constraint in dataset row-ID space (_rowid)
If stable row IDs are disabled, _rowid is row address, allowlist must be snapshot-bound.
If stable row IDs are enabled, _rowid is stable across versions.

Tests added

Runtime: allowlist with ANN + flat KNN, indexed + unindexed FTS, unstable row IDs, deletions.
IO: allowlist mask in FilteredReadExec (including scan range ordering).
Plain scan + filter intersection

Performance evidence (local, not committed)

Plain scan with large IN filter vs allowlist:

1e6 rows
- filter_in_only: ~18.6 ms
- allowlist_only: ~1.08 ms
- ~17× faster
1e7 rows
- filter_in_only: ~215.3 ms
- allowlist_only: ~6.95 ms
- ~31x faster

Notes / limitations

IO-level allowlist pruning only applies to non-legacy storage.
Legacy paths remains compute-level filtering only.

github-actions · 2026-01-27T21:16:31Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

wjones127

This is cool! Coincidentally I was just talking to some colleagues about needing something like that (#5832) so I'm all for it.

Have one initial question about the API now. I'll dig deeper into the implementation tomorrow morning.

wjones127 · 2026-01-27T22:23:26Z

rust/lance/src/dataset/scanner.rs

+    ///
+    /// For unstable row ids, the allowlist must come from the same snapshot.
+    /// For stable row ids, values remain valid across versions.
+    pub fn row_id_allowlist<I>(&mut self, row_ids: I) -> &mut Self


What would you think of just taking an RowAddrTreeMap as input instead?

I tried going in with the least change possible. The way I see it, taking a Vec<u64> / slice / iterator as input is very ergonomic for callers and it matches how filters / IN (...) are expressed. It's a simple API surface.

I didn't think about RowAddrTreeMap. Preliminary thoughts: it's more efficient for callers that already have a map/mask and more explicit about row-address space, but it's less ergonomic for common callers (they must construct a RowAddrTreeMap) and it exposes more internal types in the API. On the plus side, it steers users toward the correct semantics.

For your use case in #5832, RowAddrTreeMap makes the most sense since they operate on row-address masks.

Also this is my first dive into Lance internals, I haven't explored in depth, but I'd imagine row_id_allowlist is easier to expose in Python bindings or higher-level code. I think it could be best to keep row_id_allowlist semantics and add a "power user" overload like row_id_allowlist_map(RowAddrTreeMap) or row_id_allowlist_mask(RowAddrMask). That keeps ergonomics while allowing zero-copy use of pre-build masks.

codecov · 2026-01-27T22:55:46Z

Codecov Report

❌ Patch coverage is 89.17197% with 68 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/scanner.rs	85.58%	11 Missing and 51 partials ⚠️
rust/lance/src/io/exec/fts.rs	96.52%	1 Missing and 3 partials ⚠️
rust/lance/src/io/exec/knn.rs	96.87%	0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

wjones127

This seems reasonable. My main concern is I'm not sure why it's necessary to modify the FTS and KNN exec nodes. Could you explain that?

rust/lance/src/dataset/scanner.rs

wjones127 · 2026-01-28T21:57:27Z

rust/lance/src/io/exec/knn.rs

+fn filter_batch_by_allowlist(
+    batch: RecordBatch,
+    allowlist: &RowAddrMask,
+) -> DataFusionResult<RecordBatch> {


This function is duplicated with the one in fts.rs, right?

Yes, it is intentional for now. The two version differ only in error context ("KNN input" vs "FTS input"), and I kept it local to avoid extra refactoring in this PR. I'm happy to factor it into a shared helper if you prefer.

wjones127 · 2026-01-28T21:58:25Z

rust/lance/src/io/exec/knn.rs

    pub query: ArrayRef,
    pub column: String,
    pub distance_type: DistanceType,
+    allowlist_mask: Option<Arc<RowAddrMask>>,


This doesn't make much sense to me. I would think allowlist only needs to be in nodes that do IO. Why does this need to be in KNNVectorDistanceExec and MatchQueryExec? Can't we just push down into the scans and the index PreFilter?

Good question. I push the allowlist into IO (FilteredReadExec) when possible and into the index prefilter. For ANN, the prefilter is constructed in ANNIvfSubIndexExec, and for FTS it's built in MatchQueryExec/PhraseQueryExec, so the allowlist has to be injected there. For flat/unindexed or legacy paths, some scan paths (scan_fragments / LanceScanExec) don’t accept allowlists and some inputs aren’t IO‑backed, so KNNVectorDistanceExec/FlatMatchQueryExec enforce it as a correctness backstop. This also avoid doing distance/score work on rows that will be dropped.

Okay. I'll look deeper into the code to understand better. I wonder if we somehow combine this code path with the ones that propagate deleted rows 🤔

Interesting thought, I will take a look tomorrow at the deleted rows path. Will report back.

rust/lance/src/io/exec/fts.rs

Co-authored-by: Will Jones <willjones127@gmail.com>

fredlarochelle added 6 commits January 27, 2026 16:11

feat(scanner): add row id allowlist for plain scans

829e6a5

feat(prefilter): apply allowlist mask in fts/ann prefilters

48d98d0

feat(flat-search): apply allowlist to flat fts/knn

e022077

feat(scanner): support row-id allowlist in filtered reads

2794e25

test(scanner): add row_id allowlist runtime coverage

c11971f

test(scanner): add allowlist + filter plain scan case

6075e88

fredlarochelle changed the title ~~Expose row_id allowlist in Scanner + apply across scan/FTS/vector with IO-level pruning~~ feat(scanner): add row_id allowlist support Jan 27, 2026

github-actions bot added the enhancement New feature or request label Jan 27, 2026

wjones127 reviewed Jan 27, 2026

View reviewed changes

wjones127 self-assigned this Jan 27, 2026

fredlarochelle added 2 commits January 27, 2026 18:49

test(scanner): cover row_id_allowlist paths

0667bbd

test(scanner): fix clippy in allowlist test

69676c4

fredlarochelle mentioned this pull request Jan 28, 2026

perf(ivf): reduce slow IVF v2 test runtime #5838

Open

wjones127 reviewed Jan 28, 2026

View reviewed changes

fredlarochelle and others added 2 commits January 28, 2026 19:23

perf(scanner): preallocate row_ids buffer

a3aa236

Co-authored-by: Will Jones <willjones127@gmail.com>

test(fts/knn): use record_batch macro in allowlist tests

994f174

Conversation

fredlarochelle commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Semantics

Tests added

Performance evidence (local, not committed)

Notes / limitations

Uh oh!

github-actions bot commented Jan 27, 2026

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

wjones127 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

fredlarochelle Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wjones127 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

fredlarochelle Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

wjones127 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

fredlarochelle Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

wjones127 Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

fredlarochelle Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fredlarochelle commented Jan 27, 2026 •

edited

Loading

fredlarochelle Jan 27, 2026 •

edited

Loading

codecov bot commented Jan 27, 2026 •

edited

Loading