perf: speedup flat fts by westonpace · Pull Request #6054 · lance-format/lance

westonpace · 2026-02-27T23:36:46Z

This adds various performance improvements to the flat FTS search. The most significant improvement is that it parallelizes the search.

This does have some impact to accuracy. To calculate bm25 we typically need to make two passes through the data. The first to count token frequencies and the second to count token scores. The current implementation avoids this by using "token frequency so far" when calculating the bm25 score. This is generally accurate when there is a lot of indexed data and a small amount of unindexed data because the "token frequency so far" gets bootstrapped by the frequencies from the index and so the effect of the unindexed frequencies are minimal.

However, this can be more significant when there is no index or the unindexed data makes up a significant portion of the data. In that case the "token frequency so far" can be quite inaccurate for the first few documents.

In parallelizing this search we make this problem worse since each thread is calculating its own independent "token frequency so far" and it will take longer for each one to arrive at a more accurate result.

The most accurate approach would probably be to just accumulate all data in memory, tokenize (in parallel), count token frequencies (back to serial), then calculate scores (in parallel). This does run the risk of accumulating too much data however.

Another alternative could be to accumulate up to some amount (e.g. 100MB), calculate initial token frequencies, and then parallelize the rest of the search using those initial token frequencies. I'm open to suggestions. In the meantime we could probably proceed with this PR as-is.

In addition to the parallelization this PR also makes various changes to the algorithm itself to avoid string copies. This cuts down on the CPU time by 5x on my system.

github-actions · 2026-02-27T23:39:01Z

PR Review

P0: Bug — `MemBM25Scorer::update` silently drops tokens when `index` is `None`

In the None (no index) path of flat_bm25_search_stream, the scorer is initialized with an empty HashMap:

None => MemBM25Scorer::new(0, 0, HashMap::new()),

But MemBM25Scorer::update was changed to only update existing keys:

if let Some(old_count) = self.token_docs.get_mut(token) {
    *old_count += *count;
} else {
    log::warn!("Token {} not found in token_docs", token);
}

The old code used .entry(token.clone()).or_insert(0) which would create new entries. The new code never populates token_docs, so num_docs_containing_token() always returns 0, producing incorrect (inflated) IDF values for all queries when there is no index.

The fix applied to the index_bm25_scorer.num_docs() == 0 branch (pre-populating token_docs from query tokens) should also be applied to the None branch.

P1: `doc_norm` uses matching-token count instead of total document length

let doc_norm = K1 * (1.0 - B + B * num_matching_tokens as f32 / scorer.avg_doc_length());

In BM25, dl should be the total number of tokens in the document (num_tokens), not just the tokens matching the query (num_matching_tokens). avg_doc_length() correctly uses all tokens, so the numerator should too. This was also wrong in the old code (used doc_tokens.len() after filtering), but since the code is being rewritten here, this would be a good time to fix it.

Minor

ESTIMATED_MAX_TOKENS_PER_ROW is defined but never used — should be removed or used.
num_docs_containing_token(&self, token: &String) changed from &str to &String — prefer keeping &str as it's more general.
There's a review question left as a code comment in do_flat_full_text_search (// What is this assertion for? ...). This should be resolved or removed before merge.

wjones127 · 2026-02-27T23:47:38Z

The most accurate approach would probably be to just accumulate all data in memory, tokenize (in parallel), count token frequencies (back to serial), then calculate scores (in parallel). This does run the risk of accumulating too much data however.

I feel like a very efficient way would be to tokenize in parallel, and then collect all tokenized data and compute in the end.

IIRC tokenization is usually the bottleneck, and I'd imagine the tokenized data is smaller than the original text, especially if you are able to filter out tokens that aren't relevant to the query.

codecov · 2026-02-28T00:23:28Z

Codecov Report

❌ Patch coverage is 85.20710% with 100 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/io/exec/fts.rs	72.25%	38 Missing and 10 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs	88.11%	15 Missing and 9 partials ⚠️
rust/lance/src/dataset/scanner.rs	74.35%	6 Missing and 4 partials ⚠️
rust/lance-index/src/scalar/inverted/scorer.rs	0.00%	7 Missing ⚠️
...x/src/scalar/inverted/tokenizer/lance_tokenizer.rs	0.00%	6 Missing ⚠️
rust/lance-arrow/src/stream.rs	98.40%	2 Missing and 1 partial ⚠️
rust/lance/src/index/vector/ivf/io.rs	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

westonpace · 2026-02-28T00:33:39Z

I feel like a very efficient way would be to tokenize in parallel, and then collect all tokenized data and compute in the end.

IIRC tokenization is usually the bottleneck, and I'd imagine the tokenized data is smaller than the original text, especially if you are able to filter out tokens that aren't relevant to the query.

This should be doable! We can tokenize and discard all tokens that aren't in the query and then accumulate. We might also need to accumulate token counts for each row but that should be small too. Maybe for a first pass, if we exceed some limit (e.g. 1GB) then we log a warning and just keep going (eventually OOM). The warning could be something like...

Accumualted more than 1GB of tokenized flat search data.  This means there is a lot of unindexed text data being searched.  The search is going to be very RAM intensive and may eventually OOM.  Create (or update) an index on the text column to avoid this issue.

westonpace · 2026-02-28T00:34:46Z

Both of claude's suggestions are valid. I will revisit this weekend / Monday when I have time to add some regression tests for these cases.

westonpace · 2026-03-02T16:00:56Z

This should be doable! We can tokenize and discard all tokens that aren't in the query and then accumulate. We might also need to accumulate token counts for each row but that should be small too. Maybe for a first pass, if we exceed some limit (e.g. 1GB) then we log a warning and just keep going (eventually OOM). The warning could be something like...

Ok, I've implemented this approach. We now only accumulate (2 + N) * u64. The first two u64 are row id and total tokens in doc. The N u64s are the token count for each token in the query. Since query strings are relatively bounded this should be relatively modest in the M's of rows. Once we get to 100's of M or B's this will start to trigger OOMs and so it is important to create an FTS index before that.

BubbleCal · 2026-03-02T17:20:39Z

rust/lance-index/src/scalar/inverted/index.rs

+            if has_query_token(doc, &mut tokenizer, &query_tokens) {
                results.push(row_id_array.value(i));
+                // What is this assertion for?  Why would doc contain query?  Don't we reach
+                // here only if they share at least one token?  Why is it not debug_assert?


ah i can't remember why i added this assertion but it looks not reasonable, free to remove it

BubbleCal · 2026-03-02T17:40:10Z

rust/lance-index/src/scalar/inverted/index.rs

+            if score > 0.0 {
+                row_ids_builder.append_value(row_id);
+                scores_builder.append_value(score);


it seems to append the row_id multiple times then we will get duplicated results if we have multiple tokens in the query?

Ouch, good catch. I moved the append outside the for loop.

BubbleCal · 2026-03-02T17:41:35Z

rust/lance-index/src/scalar/inverted/index.rs

        for token in query_tokens {
-            let freq = doc_token_count.get(token).copied().unwrap_or_default() as f32;
-
+            let freq = query_token_counts_iter.next().expect_ok()? as f32;


should we consume this before continue at line 2700?

Yes, good catch.

wjones127

I like the approach. Had some minor questions plus I think Yang has some good questions about correctness within flat_bm25_score

rust/lance-index/src/scalar/rtree.rs

wjones127 · 2026-03-02T19:36:43Z

rust/lance-index/src/scalar/inverted/scorer.rs

    }

-    pub fn num_docs_containing_token(&self, token: &str) -> usize {
+    pub fn num_docs_containing_token(&self, token: &String) -> usize {


Why this change?

Reverted. I swear at one point it was complaining but it seems fine now.

Xuanwo · 2026-03-03T13:14:52Z

rust/lance-arrow/src/stream.rs

+                state.accumulated.pop().unwrap()
+            } else {
+                let b =
+                    arrow_select::concat::concat_batches(&state.input_schema, &state.accumulated)


The logic here appears to be quite sensitive to max_bytes. Could our data in state.accumulated be repeatedly concatenated and sliced?

For example, if max_bytes is 1MiB and we got 64MiB data, the data inside state.accumulated with be 64 -> 63 -> 62 ... -> 2 -> 1. Should we maintain an offset about the sliced data and make sure we only concat once on the raw input?

Good catch. I've modified the code so it only concatenates if the first slice is not large enough (smaller than min_bytes). Since min_bytes should be sufficiently less than max_bytes I think we should be good in most cases. In the event we have some really large outlier row and, as a result, slice inappropriately, we may still concatenate but I think that's enough of an outlier for the moment.

This adds various performance improvements to the flat FTS search. The most significant improvement is that it parallelizes the search. This does have some impact to accuracy. To calculate bm25 we typically need to make two passes through the data. The first to count token frequencies and the second to count token scores. The current implementation avoids this by using "token frequency so far" when calculating the bm25 score. This is generally accurate when there is a lot of indexed data and a small amount of unindexed data because the "token frequency so far" gets bootstrapped by the frequencies from the index and so the effect of the unindexed frequencies are minimal. However, this can be more significant when there is no index or the unindexed data makes up a significant portion of the data. In that case the "token frequency so far" can be quite inaccurate for the first few documents. In parallelizing this search we make this problem worse since each thread is calculating its own independent "token frequency so far" and it will take longer for each one to arrive at a more accurate result. The most accurate approach would probably be to just accumulate all data in memory, tokenize (in parallel), count token frequencies (back to serial), then calculate scores (in parallel). This does run the risk of accumulating too much data however. Another alternative could be to accumulate up to some amount (e.g. 100MB), calculate initial token frequencies, and then parallelize the rest of the search using those initial token frequencies. I'm open to suggestions. In the meantime we could probably proceed with this PR as-is. In addition to the parallelization this PR also makes various changes to the algorithm itself to avoid string copies. This cuts down on the CPU time by 5x on my system.

github-actions bot added the performance label Feb 27, 2026

westonpace force-pushed the perf/speedup-flat-fts branch from 81dbfbf to 902b77e Compare March 2, 2026 15:59

github-actions bot added the python label Mar 2, 2026

BubbleCal reviewed Mar 2, 2026

View reviewed changes

wjones127 reviewed Mar 2, 2026

View reviewed changes

BubbleCal approved these changes Mar 3, 2026

View reviewed changes

Xuanwo reviewed Mar 3, 2026

View reviewed changes

westonpace mentioned this pull request Mar 3, 2026

Simplify rules when vector/fts used as filter #6076

Open

Xuanwo approved these changes Mar 3, 2026

View reviewed changes

wojiaodoubao mentioned this pull request Mar 3, 2026

refactor: optimize query_filter semantics #6086

Open

westonpace added 8 commits March 3, 2026 05:51

Various perf improvements to flat FTS search

2bfccad

Rework according to PR suggestions

7dfee5e

Cleanup to get tests passing

906b17e

Clippy suggestions

690d8b2

Revert change in func sig

4facb4b

Clippy suggestions and test case fixes

01e3db0

Address review comments

0e462e4

Missed issues during rebase

2bdcdce

westonpace force-pushed the perf/speedup-flat-fts branch from d8121cf to 2bdcdce Compare March 3, 2026 14:16

westonpace merged commit 669d6b6 into lance-format:main Mar 3, 2026
28 checks passed

Conversation

westonpace commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

PR Review

P0: Bug — MemBM25Scorer::update silently drops tokens when index is None

P1: doc_norm uses matching-token count instead of total document length

Minor

Uh oh!

wjones127 commented Feb 27, 2026

Uh oh!

codecov bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

westonpace commented Feb 28, 2026

Uh oh!

westonpace commented Feb 28, 2026

Uh oh!

westonpace commented Mar 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

P0: Bug — `MemBM25Scorer::update` silently drops tokens when `index` is `None`

P1: `doc_norm` uses matching-token count instead of total document length

codecov bot commented Feb 28, 2026 •

edited

Loading