Skip to content

feat: add file list with sizes to IndexMetadata#5497

Draft
wjones127 wants to merge 10 commits intolance-format:mainfrom
wjones127:feature/index-file-sizes-v2
Draft

feat: add file list with sizes to IndexMetadata#5497
wjones127 wants to merge 10 commits intolance-format:mainfrom
wjones127:feature/index-file-sizes-v2

Conversation

@wjones127
Copy link
Contributor

Closes #5226

wjones127 and others added 6 commits December 16, 2025 10:33
Add a new `files` field to `IndexMetadata` that stores a list of all
files (with sizes) per index segment. This enables:

1. Skipping HEAD calls when opening indices by using cached file sizes
2. Exposing total index size via `describe_indices()` API

The file sizes are captured during index creation and stored in the
manifest. When opening indices, the cached sizes are used to avoid
expensive HEAD calls to cloud storage.

Fixes lance-format#5226

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add backwards compatibility support for indices created before the
`files` field was added to IndexMetadata:

- Add migration in `migrate_indices()` to collect file sizes for indices
  missing them during write operations
- Add test data (`pre_file_sizes/index_without_file_sizes`) created with
  lance 2.0.0-beta.1 which doesn't have the files field
- Add `test_index_without_file_sizes` to verify old indices still work
- Add `test_index_file_size_migration` to verify migration works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, file sizes were only captured during initial index creation.
This change adds file size capture to all remaining code paths:

- index/append.rs: delta append merge for scalar and vector indices
- index/vector.rs: initialize_vector_index for cross-dataset index init
- index.rs: remap_index for both scalar and vector index remapping
- dataset/optimize/remapping.rs: added files field to RemappedIndex

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests that index file sizes are populated correctly through:
- Initial index creation (BTree, Bitmap, Inverted)
- optimize_indices (update/merge operations)
- Index remapping after compaction

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
All scalar index implementations now populate the `files` field with
actual file sizes when creating, updating, or remapping indices. This
change adds the `list_files_with_sizes()` method to the `IndexStore`
trait and updates all implementations to use it.

Updated files:
- bitmap.rs, btree.rs, bloomfilter.rs, inverted.rs, inverted/index.rs
- json.rs, label_list.rs, ngram.rs, zonemap.rs
- lance_format.rs (implements the new trait method)
- lance/index.rs, append.rs, scalar.rs (simplified to use new trait)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Consolidate duplicate code introduced in the index file sizes feature:

- Remove duplicate IndexFile struct from lance-index, re-export from
  lance-table instead
- Add shared list_index_files_with_sizes function in lance-table
- Remove 5 duplicate implementations across lance-index and lance crates
- Remove unnecessary type conversion code in append.rs
- Remove deprecated list_files_with_sizes_tuple method

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot added enhancement New feature or request python labels Dec 16, 2025
index_version,
created_at,
base_id,
files: None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we need to add bindings in Python to allow users to fill this in.

wjones127 and others added 3 commits December 17, 2025 12:10
Add Python bindings to allow users to set the `files` field on Index
when committing indices via LanceOperation.CreateIndex. This enables
tracking file sizes for index segments.

Changes:
- Add IndexFile dataclass to dataset.py
- Add files field to Index dataclass
- Add PyLance<IndexFile> FromPyObject/IntoPyObject implementations
- Update IndexMetadata bindings to handle files field
- Add test verifying files field round-trips through commit

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests to verify that querying indices uses minimal IOPs. Currently:
- BTREE and BITMAP indices pass (no HEAD requests)
- INVERTED and IVF_PQ indices have HEAD requests that need to be fixed

Also fix a pre-existing bug in dataset_migrations.rs (unwrap record_batch macro result).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- IVF v2 indices: Pass file_sizes to try_new() to use cached sizes
  when opening index.idx and auxiliary.idx files
- open_generic_index: Use files metadata to determine if index is
  vector vs scalar, avoiding HEAD request to check index.idx existence

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add file list to index metadata

1 participant