feat!: introduce fragment reuse index to defer compaction index remap by jackye1995 · Pull Request #3847 · lance-format/lance

jackye1995 · 2025-05-19T06:29:23Z

BREAKING CHANGE: Signature of dataset::optimize::commit_compaction has renamed options as remap_options, and takes additional CompactionOptions as options. The dataset::transaction::Operation::Rewrite adds a new field frag_reuse_index: Option<Index> that holds optional fragment reuse index metadata if present.

This PR introduces a new index, fragment reuse index, to store compaction row address information after compactions. The index will be created when compaction has option defer_index_remap enabled.

The index is named as "fragment reuse" since the ultimate goal is to allow query to reuse old indexed fragments in replacement of new unindexed fragments for index queries. An alternative was to name it a "remap index", but that name was too confusing as the term "remap" is also a verb with functions like remap_index already in the codebase.

Closes #3833 #3834

codecov-commenter · 2025-05-19T07:02:40Z

Codecov Report

Attention: Patch coverage is 79.15452% with 143 lines in your changes missing coverage. Please review.

Project coverage is 78.29%. Comparing base (bf2f72e) to head (3905fe5).

Files with missing lines	Patch %	Lines
rust/lance/src/index/frag_reuse.rs	61.41%	42 Missing and 7 partials ⚠️
rust/lance-index/src/frag_reuse.rs	75.17%	33 Missing and 2 partials ⚠️
rust/lance/src/index.rs	6.66%	27 Missing and 1 partial ⚠️
rust/lance/src/dataset/optimize.rs	96.53%	6 Missing and 6 partials ⚠️
rust/lance/src/index/cache.rs	52.17%	10 Missing and 1 partial ⚠️
rust/lance/src/dataset/transaction.rs	61.11%	7 Missing ⚠️
rust/lance-index/src/lib.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             main    #3847    +/-   ##
========================================
  Coverage   78.29%   78.29%            
========================================
  Files         281      283     +2     
  Lines      108112   108783   +671     
  Branches   108112   108783   +671     
========================================
+ Hits        84641    85167   +526     
- Misses      20126    20254   +128     
- Partials     3345     3362    +17

Flag	Coverage Δ
unittests	`78.29% <79.15%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wjones127 · 2025-05-19T17:07:34Z

rust/lance-index/src/reuse.rs

+pub async fn save_row_id_map(
+    index_store: Arc<dyn IndexStore>,
+    dataset_version: &u64,
+    row_id_map: &HashMap<u64, Option<u64>>,


This structure feels a bit unfortunate given that the common case is mapping sorted contiguous ranges to other sorted contiguous ranges. Have you considered re-using some of the U64Segment or RowIdSequence machinery?

Yeah I used HashMap becuase these row ID maps come from the RewriteResult of optimize.rewrite_files, at that point it's already a hashmap and the sorting information is already lost. But I can make this storage more efficient with the data structure you mentioned, and then make the upstream changes in optimize based on what we arrive at here.

wjones127 · 2025-05-19T17:13:41Z

rust/lance-index/src/reuse.rs

+pub async fn load_row_id_map(
+    file_path: &str,
+    store: Arc<dyn IndexStore>,
+) -> Result<HashMap<u64, Option<u64>>> {


If we are going to use this within the conflict resolution code path, I think we should aggressive in limiting IO requests. You can add a unit test that measures the number of read and write IO requests using StatsHolder. Here is an example test that does this:

https://github.com/lancedb/lance/blob/01962adcd8d188a1dd1a9daa5ca5ebcc524d9b47/rust/lance/src/index.rs#L1835

Might need to store the file size of the index file in the metadata so that you can skip the additional head request when opening the index file.

wjones127 · 2025-05-19T17:14:47Z

protos/table.proto

+    // version of the dataset corresponds to the dataset_version at the time the index adds this version entry
+    uint64 dataset_version = 1;


If we want to use this for conflict resolution, we should probably add this index in the same commit as compaction.

Yes that is the goal, this should be the same dataset version as the compaction

jackye1995 · 2025-05-29T17:05:14Z

looks like this breaks backwards compatibility, will do the version bump while the CI is running

jackye1995 · 2025-05-29T17:47:52Z

Given #3882 is also breaking, I will wait to see if this PR meets the next release, if not I will bump the versions.

wjones127 · 2025-05-29T16:29:43Z

protos/table.proto

+    // a roaring treemap of the changed row addresses
+    bytes changed_row_addrs = 2;


I feel like this could be explained more, like how it is used. I feel like a roaring treemap acts as a set, but we likely want a mapping?

Or is the intended use "if you have row addrs in here, then use the old fragments instead of the new fragments while scanning / taking"?

the changed_row_addrs + old fragment ids + new fragment ids can be used to together recover the full old address to new address mapping, and that is used for separated remap (another PR) and use for scanning/taking at read path (another PR). I can add the details there.

wjones127 · 2025-05-29T20:30:10Z

rust/lance-index/src/frag_reuse.rs

+use snafu::location;
+use std::{any::Any, collections::HashMap, sync::Arc};
+
+pub const FRAG_REUSE_INDEX_NAME: &str = "__lance_frag_reuse";


Are we going to ignore this index type in the in list_indices() output?

No, because we still need to be able to discover this index.

Yeah, I'm just thinking about whether it will confuse users. We should be careful when we bring this up to lancedb. I think it might error if list_indices() returns an index type it doesn't recognize.

I see let me create an issue to track this

wjones127 · 2025-05-29T21:34:42Z

rust/lance/src/dataset/transaction.rs

                .fields
                .iter()
                .all(|field_id| field_ids.contains(field_id))
+                || existing_index.name == FRAG_REUSE_INDEX_NAME


Does this mean the frag reuse indices will never go away? How long will we retain them?

after all the indexes are remapped it can go away, so depending on the reindexing process. But given we want to continuously compact and reindex at the same time, this intermediate state might always exist.

wjones127 · 2025-05-29T21:36:58Z

rust/lance/src/index/cache.rs

    // TODO: Can we merge these two caches into one for uniform memory management?
    scalar_cache: Arc<Cache<String, Arc<dyn ScalarIndex>>>,
    vector_cache: Arc<Cache<String, Arc<dyn VectorIndex>>>,
+    frag_reuse_cache: Arc<Cache<String, Arc<FragReuseIndex>>>,


🤣 😭 I will consolidate these soon...

yeah I saw the TODO so was thinking it could be done together, currently I am just keeping things consistent...

wjones127 · 2025-05-29T21:39:14Z

rust/lance/src/index/frag_reuse.rs

+                    let pb_sequence = InlineContent::decode(data)?;
+                    FragReuseIndexDetails::try_from(pb_sequence)


Are we double caching this data? Won't it also be stored in the index cache?

yeah you are right, the content will be in the index cache anyway so this is not necessary. Let me remove

wjones127 · 2025-05-29T22:24:36Z

protos/table.proto

+    // When combined with the old fragment IDs and new fragment IDs,
+    // it can recover the full mapping of old row addresses to either new row addresses or deleted.
+    // this mapping can then be used to remap indexes or satisfy index queries for the new unindexed fragments.


Does this assume that no re-ordering happens?

yes the order for which the old and new fragments are stored must preserve what happened at the rewrite time. That's actually a good point, the comment below should be updated, since structures like roaring bitmap cannot be used to store this, only a list. Let me update the comments.

ztorchan · 2025-09-12T01:59:03Z

rust/lance/src/dataset/optimize.rs

        rechunk_stable_row_ids(dataset.as_ref(), &mut new_fragments, &fragments).await?;

-        HashMap::new()
+        if options.defer_index_remap {


Hello, It seems that there will be a empty FragReuseIndex when use_stable_row_id && defer_index_remap.
Why we need the empty FragReuseIndex?

Hi @ztorchan I would suggest you raise a new issue to file a problem you have encountered. This work was done before the stable row id feature. And recently we have changed some logic for realizing the update-stable row ID.

Yeah let's discuss it separately. But overall I think maybe we should use a separated flag here as use_stable_row_id_for_index. Because that's what this flag is doing here, and it assumes either you enable that flag and you don't need to remap, or you enable the other flag then you defer the remap.

jackye1995 requested a review from westonpace May 19, 2025 06:29

github-actions bot added the enhancement New feature or request label May 19, 2025

jackye1995 force-pushed the remape branch from 0fd99a1 to f562227 Compare May 19, 2025 15:53

jackye1995 requested a review from wjones127 May 19, 2025 16:54

wjones127 reviewed May 19, 2025

View reviewed changes

jackye1995 force-pushed the remape branch from 98af92f to c10a651 Compare May 29, 2025 09:02

jackye1995 changed the title ~~feat: introduce fragment reuse index to store compaction row ID maps~~ feat: introduce fragment reuse index May 29, 2025

jackye1995 changed the title ~~feat: introduce fragment reuse index~~ feat: introduce fragment reuse index for compaction May 29, 2025

jackye1995 changed the title ~~feat: introduce fragment reuse index for compaction~~ feat: introduce fragment reuse index to defer compaction index remap May 29, 2025

jackye1995 requested a review from wjones127 May 29, 2025 09:05

jackye1995 force-pushed the remape branch 3 times, most recently from 417f95a to 35f58ec Compare May 29, 2025 16:25

github-actions bot added the python label May 29, 2025

wjones127 reviewed May 29, 2025

View reviewed changes

wjones127 approved these changes May 29, 2025

View reviewed changes

jackye1995 changed the title ~~feat: introduce fragment reuse index to defer compaction index remap~~ feat!: introduce fragment reuse index to defer compaction index remap May 29, 2025

github-actions bot added the breaking-change label May 29, 2025

jackye1995 added 7 commits May 29, 2025 15:40

feat: introduce fragment reuse index

b645c9b

fix lint

741ec56

fix typo

a62c395

fix python

f304f88

address comments

736e6ac

fix typo

ba0337a

update comment

3905fe5

jackye1995 force-pushed the remape branch from 538a64f to 3905fe5 Compare May 29, 2025 22:43

jackye1995 merged commit a26524c into lance-format:main May 29, 2025
27 checks passed

jackye1995 mentioned this pull request May 30, 2025

bug(python): list_indices should skip fragment reuse index lancedb/lancedb#2412

Closed

ztorchan reviewed Sep 12, 2025

View reviewed changes

jackye1995 mentioned this pull request Sep 18, 2025

Compaction should not allow both defer_index_remap and enable_stable_row_id #4767

Open

		// version of the dataset corresponds to the dataset_version at the time the index adds this version entry
		uint64 dataset_version = 1;

		// a roaring treemap of the changed row addresses
		bytes changed_row_addrs = 2;

		let pb_sequence = InlineContent::decode(data)?;
		FragReuseIndexDetails::try_from(pb_sequence)

Conversation

jackye1995 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented May 29, 2025

Uh oh!

jackye1995 commented May 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jackye1995 commented May 19, 2025 •

edited

Loading

codecov-commenter commented May 19, 2025 •

edited

Loading

jackye1995 Sep 12, 2025 •

edited

Loading