feat(scanner): add row_id allowlist support#5831
feat(scanner): add row_id allowlist support#5831fredlarochelle wants to merge 10 commits intolance-format:mainfrom
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
wjones127
left a comment
There was a problem hiding this comment.
This is cool! Coincidentally I was just talking to some colleagues about needing something like that (#5832) so I'm all for it.
Have one initial question about the API now. I'll dig deeper into the implementation tomorrow morning.
| /// | ||
| /// For unstable row ids, the allowlist must come from the same snapshot. | ||
| /// For stable row ids, values remain valid across versions. | ||
| pub fn row_id_allowlist<I>(&mut self, row_ids: I) -> &mut Self |
There was a problem hiding this comment.
What would you think of just taking an RowAddrTreeMap as input instead?
There was a problem hiding this comment.
I tried going in with the least change possible. The way I see it, taking a Vec<u64> / slice / iterator as input is very ergonomic for callers and it matches how filters / IN (...) are expressed. It's a simple API surface.
I didn't think about RowAddrTreeMap. Preliminary thoughts: it's more efficient for callers that already have a map/mask and more explicit about row-address space, but it's less ergonomic for common callers (they must construct a RowAddrTreeMap) and it exposes more internal types in the API. On the plus side, it steers users toward the correct semantics.
For your use case in #5832, RowAddrTreeMap makes the most sense since they operate on row-address masks.
Also this is my first dive into Lance internals, I haven't explored in depth, but I'd imagine row_id_allowlist is easier to expose in Python bindings or higher-level code. I think it could be best to keep row_id_allowlist semantics and add a "power user" overload like row_id_allowlist_map(RowAddrTreeMap) or row_id_allowlist_mask(RowAddrMask). That keeps ergonomics while allowing zero-copy use of pre-build masks.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
wjones127
left a comment
There was a problem hiding this comment.
This seems reasonable. My main concern is I'm not sure why it's necessary to modify the FTS and KNN exec nodes. Could you explain that?
| fn filter_batch_by_allowlist( | ||
| batch: RecordBatch, | ||
| allowlist: &RowAddrMask, | ||
| ) -> DataFusionResult<RecordBatch> { |
There was a problem hiding this comment.
This function is duplicated with the one in fts.rs, right?
There was a problem hiding this comment.
Yes, it is intentional for now. The two version differ only in error context ("KNN input" vs "FTS input"), and I kept it local to avoid extra refactoring in this PR. I'm happy to factor it into a shared helper if you prefer.
| pub query: ArrayRef, | ||
| pub column: String, | ||
| pub distance_type: DistanceType, | ||
| allowlist_mask: Option<Arc<RowAddrMask>>, |
There was a problem hiding this comment.
This doesn't make much sense to me. I would think allowlist only needs to be in nodes that do IO. Why does this need to be in KNNVectorDistanceExec and MatchQueryExec? Can't we just push down into the scans and the index PreFilter?
There was a problem hiding this comment.
Good question. I push the allowlist into IO (FilteredReadExec) when possible and into the index prefilter. For ANN, the prefilter is constructed in ANNIvfSubIndexExec, and for FTS it's built in MatchQueryExec/PhraseQueryExec, so the allowlist has to be injected there. For flat/unindexed or legacy paths, some scan paths (scan_fragments / LanceScanExec) don’t accept allowlists and some inputs aren’t IO‑backed, so KNNVectorDistanceExec/FlatMatchQueryExec enforce it as a correctness backstop. This also avoid doing distance/score work on rows that will be dropped.
There was a problem hiding this comment.
Okay. I'll look deeper into the code to understand better. I wonder if we somehow combine this code path with the ones that propagate deleted rows 🤔
There was a problem hiding this comment.
Interesting thought, I will take a look tomorrow at the deleted rows path. Will report back.
Co-authored-by: Will Jones <willjones127@gmail.com>
We hit a severe bottleneck when filtering on very large
IN (...)lists (millions of comparisons) in scalar filters. We can cheaply pre-compute a bitmap row-ID allowlist in memory, but Lance had no way to inject it into the query pipeline. This PR exposes that path so we can bypass expensive scalar filtering and push the allowlist directory into scan / FTS / vector execution, including IO-level pruning on non-legacy storage. Once this PR is merged, we plan to add LanceDB support.What this PR does
Scanner:row_id_allowlistfor supplying a row-ID allowlist (dataset row-ID space).TakeOperations, then treating any remaining filters as refine-only.FilteredReadExec.Semantics
_rowid)_rowidis row address, allowlist must be snapshot-bound._rowidis stable across versions.Tests added
FilteredReadExec(including scan range ordering).Performance evidence (local, not committed)
Plain scan with large IN filter vs allowlist:
Notes / limitations