feat: add file list with sizes to IndexMetadata#5497
Draft
wjones127 wants to merge 10 commits intolance-format:mainfrom
Draft
feat: add file list with sizes to IndexMetadata#5497wjones127 wants to merge 10 commits intolance-format:mainfrom
wjones127 wants to merge 10 commits intolance-format:mainfrom
Conversation
Add a new `files` field to `IndexMetadata` that stores a list of all files (with sizes) per index segment. This enables: 1. Skipping HEAD calls when opening indices by using cached file sizes 2. Exposing total index size via `describe_indices()` API The file sizes are captured during index creation and stored in the manifest. When opening indices, the cached sizes are used to avoid expensive HEAD calls to cloud storage. Fixes lance-format#5226 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add backwards compatibility support for indices created before the `files` field was added to IndexMetadata: - Add migration in `migrate_indices()` to collect file sizes for indices missing them during write operations - Add test data (`pre_file_sizes/index_without_file_sizes`) created with lance 2.0.0-beta.1 which doesn't have the files field - Add `test_index_without_file_sizes` to verify old indices still work - Add `test_index_file_size_migration` to verify migration works 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, file sizes were only captured during initial index creation. This change adds file size capture to all remaining code paths: - index/append.rs: delta append merge for scalar and vector indices - index/vector.rs: initialize_vector_index for cross-dataset index init - index.rs: remap_index for both scalar and vector index remapping - dataset/optimize/remapping.rs: added files field to RemappedIndex 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests that index file sizes are populated correctly through: - Initial index creation (BTree, Bitmap, Inverted) - optimize_indices (update/merge operations) - Index remapping after compaction 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
All scalar index implementations now populate the `files` field with actual file sizes when creating, updating, or remapping indices. This change adds the `list_files_with_sizes()` method to the `IndexStore` trait and updates all implementations to use it. Updated files: - bitmap.rs, btree.rs, bloomfilter.rs, inverted.rs, inverted/index.rs - json.rs, label_list.rs, ngram.rs, zonemap.rs - lance_format.rs (implements the new trait method) - lance/index.rs, append.rs, scalar.rs (simplified to use new trait) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Consolidate duplicate code introduced in the index file sizes feature: - Remove duplicate IndexFile struct from lance-index, re-export from lance-table instead - Add shared list_index_files_with_sizes function in lance-table - Remove 5 duplicate implementations across lance-index and lance crates - Remove unnecessary type conversion code in append.rs - Remove deprecated list_files_with_sizes_tuple method 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
wjones127
commented
Dec 17, 2025
python/src/transaction.rs
Outdated
| index_version, | ||
| created_at, | ||
| base_id, | ||
| files: None, |
Contributor
Author
There was a problem hiding this comment.
It seems like we need to add bindings in Python to allow users to fill this in.
Add Python bindings to allow users to set the `files` field on Index when committing indices via LanceOperation.CreateIndex. This enables tracking file sizes for index segments. Changes: - Add IndexFile dataclass to dataset.py - Add files field to Index dataclass - Add PyLance<IndexFile> FromPyObject/IntoPyObject implementations - Update IndexMetadata bindings to handle files field - Add test verifying files field round-trips through commit 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests to verify that querying indices uses minimal IOPs. Currently: - BTREE and BITMAP indices pass (no HEAD requests) - INVERTED and IVF_PQ indices have HEAD requests that need to be fixed Also fix a pre-existing bug in dataset_migrations.rs (unwrap record_batch macro result). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- IVF v2 indices: Pass file_sizes to try_new() to use cached sizes when opening index.idx and auxiliary.idx files - open_generic_index: Use files metadata to determine if index is vector vs scalar, avoiding HEAD request to check index.idx existence 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #5226