Skip to content

Conversation

@jaystarshot
Copy link
Contributor

@jaystarshot jaystarshot commented Sep 3, 2025

BKD based index only for points to start with.
Only intecepts st_intersect function. I am debating whether to add other function interception in this or future PRs. Adding here might be more concise and correctness could be tested in detail but the PR could become bloated

Storage

  1. bkd_tree_inner.lance - All internal node metadata
  2. bkd_tree_leaf.lance - All leaf node metadata (pointers only)
  3. Multiple leaves grouped per file (controlled by batches_per_file and entries per leaf parameters)

This might also make future incremental indexing easier (we can rewrite metadata and select leafs)

Lucene code

doc1

javadoc

#4482 (comment)

@github-actions github-actions bot added the python label Sep 3, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2025

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some initial comments to reorganize the code structure to fit lance code style

Copy link
Contributor

@beinan beinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@ddupg
Copy link
Contributor

ddupg commented Sep 4, 2025

@jaystarshot Nice work, super excited to see this PR. Regarding the implementation of the geospatial index, I have a few thoughts:

  1. The current approach of keeping the entire R-tree in memory during index generation, serializing it as a single binary, and loading this binary into memory during queries might cause significant memory pressure.
  2. Since rstar is designed for dynamic mutable implementations, for Lance’s use case, a static immutable implementation might be more appropriate. This could potentially reduce storage footprint and might offer better query performance.

I’m currently working on a POC for geospatial index implementation. The design philosophy is to implement an immutable R-tree directly within Lance. During index generation, data will be streamed into the target file, and query operations will only load required R-tree nodes on-demand.

@jaystarshot
Copy link
Contributor Author

jaystarshot commented Sep 4, 2025

Makes sense, thanks a lot! I wonder if we can use any implementations from https://ieeexplore.ieee.org/document/4221678 or related papers which already must have tackled distributed rtrees

@jaystarshot
Copy link
Contributor Author

jaystarshot commented Sep 4, 2025

@ddupg Is this same as what you are suggesting?

  1. Keep a separate metadata file
    This file describes the tree: which nodes exist, their bounding boxes, and where each node is stored in the data file. (offset + file)

  2. Split the data into nodes
    Group entries into nodes based on their bounding boxes, so each node covers a specific region of space.

  3. Store each node at a fixed file offset
    Write every node into the data file and remember its offset.

  4. Filter at query time
    When running a query, look at the metadata to find only the nodes that overlap the query region. Load just those nodes from disk, instead of loading everything.

If so then i am wondering if we can use any existing algos from rstar or et. to

  1. reuse partitioning
  2. reuse traversal
  3. reuse Serialization
  4. and if its possible to make it self balance at the same time

@jackye1995
Copy link
Contributor

Keep a separate metadata file
This file describes the tree: which nodes exist, their bounding boxes, and where each node is stored in the data file. (offset + file)

You can add the metadata as a part of the RTreeIndexDetails, that is currently an empty protobuf definition. The details will be stored as a part of the index metadata and persisted in the manifest so you don't need to do another IO to fetch this information.

Split the data into nodes. Group entries into nodes based on their bounding boxes, so each node covers a specific region of space.

You can take a look at the btree index implementation which does something similar to it.

@jaystarshot
Copy link
Contributor Author

jaystarshot commented Sep 4, 2025

I think even b+ tree has the details empty so maybe we can do that later.
@jackye1995 @ddupg can you guys check this file which i added in this commit

the logs with the new test - link

These logs show that 3 files were generated and only 2 queried.

I used rstar interface so simulate lazy loading of the leaf pages (with leaf references implementing the enevelop() method) so rstar filters leaf nodes to check and we load them. All Leaf nodes are stored in separate files

So here we use rstar's tree traversal and lazy load it later.
I guess that can be control later (separate file vs files with offset)

@jaystarshot
Copy link
Contributor Author

@ddupg @jackye1995 will work on more testing + productionizing this^ since I believe the main problem of distributed R tree is fixed by this approach

@jackye1995
Copy link
Contributor

Sounds good, let me know whenever you want another pass on review. Also you probably want to rebase the codebase asap since we did some major refactoring. It might be more and more difficult to rebase the more you wait.

@jaystarshot
Copy link
Contributor Author

jaystarshot commented Sep 8, 2025

I think there are 3 top workstreams

  1. Support geo functions in Lance
  2. Intercept the geo functions into geo queries on Indexes
  3. Index building and retrieving

For 1 ,
@ddupg I saw that you opened a discussion in GeoDataFusion here. I believe that they have the datafusion implementations but the library is not published. That would be very promising, it would save us from having to natively implement spatial functions in Lance for the fallback (no index) path.

In the meantime, do you think it makes sense to borrow the simple DataFusion UDF implementations (e.g., ST_Within, ST_distance) into Lance so that we unblock 1 and we can build and validate the full index flow end-to-end? Then, once GeoDataFusion stabilizes, we could replace our local copies and directly reference their implementation.

What do you think?
cc: @jackye1995

@jackye1995
Copy link
Contributor

In the meantime, do you think it makes sense to borrow the simple DataFusion UDF implementations (e.g., ST_Within, ST_distance) into Lance so that we unblock 1 and we can build and validate the full index flow end-to-end? Then, once GeoDataFusion stabilizes, we could replace our local copies and directly reference their implementation.

Agree. also cc @timsaucer in case you are interested in this topic.

@timsaucer
Copy link
Contributor

In the meantime, do you think it makes sense to borrow the simple DataFusion UDF implementations (e.g., ST_Within, ST_distance) into Lance so that we unblock 1 and we can build and validate the full index flow end-to-end? Then, once GeoDataFusion stabilizes, we could replace our local copies and directly reference their implementation.

Agree. also cc @timsaucer in case you are interested in this topic.

Yes, thank you! I'm definitely interested. Also to fully support the geo functions I think we need to upgrade DataFusion. I have #4598 for this.

@jaystarshot
Copy link
Contributor Author

Wow thanks!

@jaystarshot
Copy link
Contributor Author

jaystarshot commented Sep 9, 2025

distributed index is still not complete, we should somehow let the library create partitions and then serialize. Currently we partition manually and give the bounds to rstar. This won't be efficient during query

@jaystarshot jaystarshot force-pushed the jay-geo branch 4 times, most recently from 2e16fb9 to 655f680 Compare September 9, 2025 04:34
@ddupg
Copy link
Contributor

ddupg commented Sep 9, 2025

In my personal view, the implementation of geospatial expressions and indexing should be grounded in well-defined geospatial types. To ensure clarity and technical robustness, I propose first introducing foundational geospatial data types (e.g., Point, LineString, Polygon) into Lance.

I submitted #4678 to introduce geospatial type support. However, it is also blocked due to datafusion upgrade.

@timsaucer
Copy link
Contributor

If you need Arrow 56 for #4678 then DF50, which will probably come out next week, will bring that in.

@jaystarshot
Copy link
Contributor Author

jaystarshot commented Sep 9, 2025

In my personal view, the implementation of geospatial expressions and indexing should be grounded in well-defined geospatial types. To ensure clarity and technical robustness, I propose first introducing foundational geospatial data types (e.g., Point, LineString, Polygon) into Lance.

I might be missing something but GeoArrow are just metadata extensions on list and struct and PR only seems to be adding datafusion functions and not lance native geo types

@kylebarron
Copy link

2. Since rstar is designed for dynamic mutable implementations, for Lance’s use case, a static immutable implementation might be more appropriate. This could potentially reduce storage footprint and might offer better query performance.

You may want to consider https://github.com/kylebarron/geo-index. It's a port of https://github.com/mourner/flatbush and is explicitly designed to be serializable and very memory efficient. You may be interested in this explainer I wrote on how it works: https://kylebarron.dev/literate-flatbush/

The primary downside is that it only supports 2 dimensions, so if you absolutely must index on more than 2 dimensions, it may not fit your needs.

@jaystarshot
Copy link
Contributor Author

Thanks a lot will check it out

@jaystarshot
Copy link
Contributor Author

jaystarshot commented Sep 10, 2025

@kylebarron Don't think this supports non bulk loading. (point deserialization). (So that during query we can load the entries in memory on the fly).

Lance index reader is built for arrow record batches so a tree / geo index partitioned into record batches which could be fetched during retrieval(similar to btree) would be the best solution

pub trait IndexReader: Send + Sync {
    /// Read the n-th record batch from the file
    async fn read_record_batch(&self, n: u64, batch_size: u64) -> Result<RecordBatch>;
    
   /// Read the range of rows from the file.
    /// If projection is Some, only return the columns in the projection,
    /// nested columns like Some(&["x.y"]) are not supported.
    /// If projection is None, return all columns.
    async fn read_range(
        &self,
        range: std::ops::Range<usize>,
        projection: Option<&[&str]>,
    ) -> Result<RecordBatch>;
    
     /// Return the number of batches in the file
    async fn num_batches(&self, batch_size: u64) -> u32;
    /// Return the number of rows in the file
    fn num_rows(&self) -> usize;
    /// Return the metadata of the file
    fn schema(&self) -> &lance_core::datatypes::Schema;
}

@kylebarron
Copy link

(point deserialization).

What do you mean by this?

Don't think this supports non bulk loading

Are you talking about during tree creation or re-loading the index from persisted bytes?

If it's the former, you can stream tree creation in the sense that you instantiate the builder once, and have total control over how/when you add data into the tree.

Like any(?) static index, you need to know how many rows you have, but that shouldn't be too big of a problem?

If it's fully streaming where you don't know how many record batches you'll have, then you could have a bi-level index where each RecordBatch is associated with an individual tree, then you have an index of each record batch. This is closer to what GeoParquet has, where you have only a "top-level" index in the file metadata that tells the bounding box of each RecordBatch. But in the case of GeoParquet there's no within-batch index.

Lance index reader is built for arrow record batches so a tree / geo index partitioned into record batches which could be fetched during retrieval(similar to btree) would be the best solution

So you want an index per record batch? I don't know if I'm fully understanding you.

@jaystarshot
Copy link
Contributor Author

jaystarshot commented Sep 10, 2025

Thanks! I was actually more concerned about the latter reloading the index from persisted bytes. Specifically, being able to fetch and decode just the index node/ batches touched by a query (like page access in a B-tree), since the index may become very large.

In that case, we’d want one or a small number of index nodes per Arrow record batch so they’re independently fetchable and the tree structure can be traversed to locate only the relevant nodes.

@kylebarron
Copy link

I see, some more background:

The geo-index/flatbush "format" was originally designed in the flatbush JS library. That library wanted to make the buffer ABI stable so that it could be transferred in the browser between the main thread and worker threads without a JS structured clone.

It would be possible to devise a streaming reader for the format, such that you can navigate the tree without needing to load the entire buffer in advance. But the layout of the buffer wasn't particularly important for JS browser workloads, so it was designed probably in a way that made it simplest to fill.

The downside of a streaming reader is that the metadata is located in the header at the beginning of the file, while the tree data is laid out bottom to top. So you'd need to fetch data both from the beginning and the end, and it wouldn't be ideal for high-latency network reads.

This is why having a multi-level index might be nicer, so that each individual index can be small enough to just load into memory.

In related work, the FlatGeobuf file format devised a custom fork of the same flatbush RTree logic that they use in their geospatial file format. The primary difference is that they invert the tree so that they can incrementally stream portions of the tree across a network. But this index is just an implementation detail of their format, and wouldn't be possible to reuse.

My primary reason for hesitancy about rstar is that it truly ties the format to a single, specific Rust implementation. (And how are you serializing the rstar object? Is it ABI stable?) The nice thing about geo-index is that it's much closer to a standalone "spec", so that if you ever needed a way to open Lance data outside of Rust, it wouldn't be nearly as hard.

@hyz0906 hyz0906 mentioned this pull request Sep 11, 2025
@jaystarshot
Copy link
Contributor Author

jaystarshot commented Oct 15, 2025

I think we should do something like lucene. I am not an expert in BKD trees but just sounds like each level splits the space on x or y coordinates and so on.

We can store the all node metadata like lucene in a separate (bkd_tree.lance) file and leaf separately in leaf_*.lance files

Lucene code

doc1

javadoc

message BloomFilterIndexDetails {}
message BloomFilterIndexDetails {}

message GeoIndexDetails {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call this more specifically like "KdbTreeIndex" instead of a generic Geo index, so that we can have different types of geo index that can be used for different Geo columns if necessary

static BBOX_UDF: LazyLock<ScalarUDF> = LazyLock::new(bbox);
static ST_INTERSECTS_UDF: LazyLock<ScalarUDF> = LazyLock::new(st_intersects);

fn st_intersects() -> ScalarUDF {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should directly use functions in geodatafusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that Pr is merged its just rebase. Also I remember there was some issue in datafusion python api where mixed case functions couldn't be used.


/// In-memory BKD tree structure for efficient spatial queries
#[derive(Debug, DeepSizeOf)]
pub struct BKDTreeLookup {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we prefer to still have camel case style for abbreviations, for example Hnsw instead of HNSW. So could you update BKD related terms to Bkd

@@ -0,0 +1,301 @@
#!/usr/bin/env python3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should move these as official tests in python module

@codecov-commenter
Copy link

codecov-commenter commented Oct 16, 2025

Codecov Report

❌ Patch coverage is 77.18204% with 549 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.63%. Comparing base (1b09706) to head (2ac6dca).
⚠️ Report is 32 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/geo/bkdtree.rs 76.43% 315 Missing and 70 partials ⚠️
rust/lance-index/src/scalar/expression.rs 0.00% 84 Missing ⚠️
rust/lance-index/src/scalar/geo/bkd.rs 92.28% 20 Missing and 24 partials ⚠️
rust/lance-index/src/scalar.rs 14.28% 18 Missing ⚠️
rust/lance-datafusion/src/udf.rs 83.69% 15 Missing ⚠️
rust/lance-index/src/lib.rs 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4632      +/-   ##
==========================================
- Coverage   81.70%   81.63%   -0.07%     
==========================================
  Files         334      342       +8     
  Lines      133293   139639    +6346     
  Branches   133293   139639    +6346     
==========================================
+ Hits       108902   113996    +5094     
- Misses      20736    21827    +1091     
- Partials     3655     3816     +161     
Flag Coverage Δ
unittests 81.63% <77.18%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jaystarshot
Copy link
Contributor Author

jaystarshot commented Oct 17, 2025

@jaystarshot Nice work, super excited to see this PR. Regarding the implementation of the geospatial index, I have a few thoughts:

  1. The current approach of keeping the entire R-tree in memory during index generation, serializing it as a single binary, and loading this binary into memory during queries might cause significant memory pressure.
  2. Since rstar is designed for dynamic mutable implementations, for Lance’s use case, a static immutable implementation might be more appropriate. This could potentially reduce storage footprint and might offer better query performance.

I’m currently working on a POC for geospatial index implementation. The design philosophy is to implement an immutable R-tree directly within Lance. During index generation, data will be streamed into the target file, and query operations will only load required R-tree nodes on-demand.

@ddupg What do you think of the BKD tree? I have added some uts and benchmark and the index code seems okay to me

@jaystarshot jaystarshot changed the title [WIP] Support Geo Indexes (RTree) in lance Support Point KDB Tree Geo Index in lance Oct 17, 2025
@jaystarshot jaystarshot changed the title Support Point KDB Tree Geo Index in lance Support Point BKD Tree Geo Index in lance Oct 17, 2025
@jackye1995 jackye1995 changed the title Support Point BKD Tree Geo Index in lance feat: support Point BKD Tree Geo Index Oct 17, 2025
@github-actions github-actions bot added the enhancement New feature or request label Oct 17, 2025
@jaystarshot jaystarshot marked this pull request as ready for review October 20, 2025 16:14
@jaystarshot jaystarshot changed the title feat: support Point BKD Tree Geo Index feat: support Point BKD Tree Geo Index and st_within interception Oct 20, 2025
@ddupg
Copy link
Contributor

ddupg commented Oct 22, 2025

@jaystarshot Nice work, super excited to see this PR. Regarding the implementation of the geospatial index, I have a few thoughts:

  1. The current approach of keeping the entire R-tree in memory during index generation, serializing it as a single binary, and loading this binary into memory during queries might cause significant memory pressure.
  2. Since rstar is designed for dynamic mutable implementations, for Lance’s use case, a static immutable implementation might be more appropriate. This could potentially reduce storage footprint and might offer better query performance.

I’m currently working on a POC for geospatial index implementation. The design philosophy is to implement an immutable R-tree directly within Lance. During index generation, data will be streamed into the target file, and query operations will only load required R-tree nodes on-demand.

@ddupg What do you think of the BKD tree? I have added some uts and benchmark and the index code seems okay to me

Could you @jaystarshot elaborate further on why you no longer use RTree and instead choose for BkD tree? I'm currently reviewing this PR, and BKD tree are quite complex, I'll need some more time.

Regarding geo types and functions, I suggest we directly introduce geoarrow-rs and geodatafusion. This approach enables rich geo types and functions, and avoids getting bogged down in complex geospatial knowledge. I implemented RTree #5034 precisely by doing this. Can we learn from each other's approaches to implement both RTree and BKD tree geo indexes?

@jaystarshot
Copy link
Contributor Author

Thanks! I moved there because lucene and opensearch both use bkd tree (as linked) and our use case is closer to them. Also its pretty straightforward tbh at each level we sort all points in one dimension and create 2 leafs divided by a median till we reach the thresholds for leafs(for optimization use only once sorting)

Copy link
Contributor

@ddupg ddupg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a preliminary review of the code, I have a few suggestions/comments


if inner_batch.num_rows() > 0 {
let col_idx = get_col(&inner_batch, NODE_ID)?;
let node_ids = inner_batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be simplified to inner_batch.column_by_name(NODE_ID) ?


/// Inner node in BKD tree - contains split information and child pointers
#[derive(Debug, Clone, DeepSizeOf)]
pub struct BKDInnerNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can reduce the storage space occupied by bkd tree index by eliminating some fields.

Since split_dim and split_value appear to be used only during index training and not during index searching, is serialization still necessary?

It seems possible to derive the child node ids based on the current node id, If using preorder traversal like BKDTreeBuilder::build_recursive to serialize a balanced binary tree, for example:
left_child = node_id * 2 + 1 and right_child = node_id * 2 + 2

pub fn new(params: BkdTreeBuilderParams) -> Self {
Self {
params,
criteria: TrainingCriteria::new(TrainingOrdering::Addresses).with_row_addr(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sorting here? When training a BKD tree in BKDTreeBuilder::build_recursive, the data is sorted by x or y values. Is sorting by address necessary here?


for i in 0..batch.num_rows() {
self.points
.push((x_array.value(i), y_array.value(i), row_ids.value(i)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index training requires loading all data into memory, raising concerns about encountering an OOM error.


while let Some(batch) = batches_source.try_next().await? {
// Extract GeoArrow point coordinates
let geom_array = batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value column may contain nulls, which may require separate handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants