Skip to content

Conversation

@yuqi1129
Copy link

@yuqi1129 yuqi1129 commented Dec 22, 2025

This pull request introduces a new "describe index" feature across the Rust, Java, and Python APIs, allowing users to retrieve detailed metadata and statistics about a specific index by name. The change includes the addition of a new IndexDescription class in Java, updates to the JNI and Rust layers to support the new API, and corresponding tests and documentation in all three languages.

The most important changes are:

API Additions:

  • Added a describeIndex(String indexName) method to the Java Dataset class, which returns a new IndexDescription object containing metadata and statistics for a specific index. This is supported by a new native JNI method and Rust FFI implementation. [1] [2]
  • Introduced a describe_index(self, index_name: str) method to the Python LanceDataset class, providing similar functionality for Python users. [1] [2]
  • Added a describe_index method to the Rust DatasetIndexExt trait, enabling retrieval of a single index's metadata without loading the full index.

New Data Structures:

  • Implemented a new IndexDescription class in Java to encapsulate index metadata (type, distance metric, indexed/unindexed row counts) with a builder pattern for construction.

Testing and Validation:

  • Added comprehensive tests for the new describe index functionality in Rust, Java, and Python to ensure correct behavior and error handling when describing existing and non-existent indices. [1] [2] [3]

Documentation and Imports:

  • Updated imports and documentation to reflect the new API and data structures in both Java and Python. [1] [2]

These changes make it much easier for users to programmatically inspect the properties and coverage of individual indices in a Lance dataset.

Fixed: #5553

Copilot AI review requested due to automatic review settings December 22, 2025 03:39
@github-actions github-actions bot added enhancement New feature or request python java labels Dec 22, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a describeIndex API method across Rust, Java, and Python to retrieve detailed metadata for a specific index by name, complementing the existing describe_indices method that returns all indices. The implementation provides index type, row coverage statistics, and distance metrics (for vector indices) without loading the full index into memory.

Key Changes:

  • Added describe_index method as a convenience wrapper that filters by index name
  • Introduced IndexDescription class in Java to encapsulate index metadata
  • Implemented comprehensive tests across all three language bindings

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
rust/lance-index/src/traits.rs Added default implementation of describe_index trait method that filters describe_indices by name
rust/lance/src/dataset/tests/dataset_index.rs Added comprehensive tests for BTree and Inverted indices with non-existent index handling
python/src/dataset.rs Implemented Rust FFI binding that calls native describe_index and raises PyKeyError for missing indices
python/python/lance/dataset.py Added Python API wrapper with docstring and error handling
python/python/tests/test_scalar_index.py Added tests covering INVERTED, BITMAP, and BTREE indices plus error case
java/src/main/java/org/lance/index/IndexDescription.java New class with builder pattern for index metadata (type, distance, row counts)
java/src/main/java/org/lance/Dataset.java Added public describeIndex method with validation and locking
java/lance-jni/src/blocking_dataset.rs Implemented JNI native method with index type detection and row count calculation
java/src/test/java/org/lance/ScalarIndexTest.java Added test for BTree index description

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yuqi1129
Copy link
Author

Can anyone help with what the problem is?

Auto-detected mode: agent for event: pull_request
Using provided GITHUB_TOKEN for authentication
Checking permissions for actor: yuqi1129
Permission level retrieved: read
Warning: Actor has insufficient permissions: read
Error: Prepare step failed with error: Actor does not have write permissions to the repository
Error: Process completed with exit code 1.

What should I do to solve the problem?

@yuqi1129
Copy link
Author

@majin1102 @yanghua could you please help review this one?

Copy link
Contributor

@majin1102 majin1102 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Left one comment

index_name: &str,
) -> Result<Option<Arc<dyn IndexDescription>>> {
let indices = self.describe_indices(None).await?;
Ok(indices.into_iter().find(|idx| idx.name() == index_name))
Copy link
Contributor

@majin1102 majin1102 Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we still need to load the full index metadata — likely because the underlying protobuf spec doesn’t support partial loading. I’m not sure this API is worth exposing unless we have solid use cases that describe_indices not suitable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to push Lance's REST APIs, see: https://lance.org/format/namespace/integrations/gravitino/#option-1-native-lance-rest-support into practice. Indexes are also key components for the Lance table, so I raise it to support more metadata operations.

About It seems we still need to load the full index metadata, let me check whether we can optimize it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to push Lance's REST APIs, see: https://lance.org/format/namespace/integrations/gravitino/#option-1-native-lance-rest-support

That's good to hear.

I’m not sure but I’m not opposed to it.
Thanks for this contribution

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +27 to +29
public static final String JSON_PROPERTY_DISTANCE_TYPE = "distance_type";
@Nullable
private String distanceType;
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distanceType field in IndexDescription is never populated in the JNI implementation. While the test correctly expects it to be null for scalar indices, this field will also be null for vector indices where it should contain meaningful data (e.g., "l2", "cosine", "dot").

For vector indices, the distance type information is stored in the index details (as a protobuf Any) and would need to be extracted and set during the construction of the IndexDescription object in the JNI layer. Consider parsing this information from the index details for vector indices to make this field useful, or alternatively document that this field is not yet implemented and will always return null.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support describeIndex in Dataset API

2 participants