-
Notifications
You must be signed in to change notification settings - Fork 492
feat: support Point BKD Tree Geo Index and st_within interception #4632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some initial comments to reorganize the code structure to fit lance code style
beinan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
|
@jaystarshot Nice work, super excited to see this PR. Regarding the implementation of the geospatial index, I have a few thoughts:
I’m currently working on a POC for geospatial index implementation. The design philosophy is to implement an immutable R-tree directly within Lance. During index generation, data will be streamed into the target file, and query operations will only load required R-tree nodes on-demand. |
|
Makes sense, thanks a lot! I wonder if we can use any implementations from https://ieeexplore.ieee.org/document/4221678 or related papers which already must have tackled distributed rtrees |
|
@ddupg Is this same as what you are suggesting?
If so then i am wondering if we can use any existing algos from rstar or et. to
|
You can add the metadata as a part of the
You can take a look at the btree index implementation which does something similar to it. |
|
I think even b+ tree has the details empty so maybe we can do that later. the logs with the new test - link These logs show that 3 files were generated and only 2 queried. I used rstar interface so simulate lazy loading of the leaf pages (with leaf references implementing the enevelop() method) so rstar filters leaf nodes to check and we load them. All Leaf nodes are stored in separate files So here we use rstar's tree traversal and lazy load it later. |
|
@ddupg @jackye1995 will work on more testing + productionizing this^ since I believe the main problem of distributed R tree is fixed by this approach |
|
Sounds good, let me know whenever you want another pass on review. Also you probably want to rebase the codebase asap since we did some major refactoring. It might be more and more difficult to rebase the more you wait. |
|
I think there are 3 top workstreams
For 1 , In the meantime, do you think it makes sense to borrow the simple DataFusion UDF implementations (e.g., ST_Within, ST_distance) into Lance so that we unblock 1 and we can build and validate the full index flow end-to-end? Then, once GeoDataFusion stabilizes, we could replace our local copies and directly reference their implementation. What do you think? |
Agree. also cc @timsaucer in case you are interested in this topic. |
Yes, thank you! I'm definitely interested. Also to fully support the geo functions I think we need to upgrade DataFusion. I have #4598 for this. |
|
Wow thanks! |
|
distributed index is still not complete, we should somehow let the library create partitions and then serialize. Currently we partition manually and give the bounds to rstar. This won't be efficient during query |
2e16fb9 to
655f680
Compare
|
In my personal view, the implementation of geospatial expressions and indexing should be grounded in well-defined geospatial types. To ensure clarity and technical robustness, I propose first introducing foundational geospatial data types (e.g., Point, LineString, Polygon) into Lance. I submitted #4678 to introduce geospatial type support. However, it is also blocked due to datafusion upgrade. |
|
If you need Arrow 56 for #4678 then DF50, which will probably come out next week, will bring that in. |
I might be missing something but GeoArrow are just metadata extensions on list and struct and PR only seems to be adding datafusion functions and not lance native geo types |
You may want to consider https://github.com/kylebarron/geo-index. It's a port of https://github.com/mourner/flatbush and is explicitly designed to be serializable and very memory efficient. You may be interested in this explainer I wrote on how it works: https://kylebarron.dev/literate-flatbush/ The primary downside is that it only supports 2 dimensions, so if you absolutely must index on more than 2 dimensions, it may not fit your needs. |
|
Thanks a lot will check it out |
|
@kylebarron Don't think this supports non bulk loading. (point deserialization). (So that during query we can load the entries in memory on the fly). Lance index reader is built for arrow record batches so a tree / geo index partitioned into record batches which could be fetched during retrieval(similar to btree) would be the best solution |
What do you mean by this?
Are you talking about during tree creation or re-loading the index from persisted bytes? If it's the former, you can stream tree creation in the sense that you instantiate the builder once, and have total control over how/when you add data into the tree. Like any(?) static index, you need to know how many rows you have, but that shouldn't be too big of a problem? If it's fully streaming where you don't know how many record batches you'll have, then you could have a bi-level index where each RecordBatch is associated with an individual tree, then you have an index of each record batch. This is closer to what GeoParquet has, where you have only a "top-level" index in the file metadata that tells the bounding box of each RecordBatch. But in the case of GeoParquet there's no within-batch index.
So you want an index per record batch? I don't know if I'm fully understanding you. |
|
Thanks! I was actually more concerned about the latter reloading the index from persisted bytes. Specifically, being able to fetch and decode just the index node/ batches touched by a query (like page access in a B-tree), since the index may become very large. In that case, we’d want one or a small number of index nodes per Arrow record batch so they’re independently fetchable and the tree structure can be traversed to locate only the relevant nodes. |
|
I see, some more background: The geo-index/flatbush "format" was originally designed in the It would be possible to devise a streaming reader for the format, such that you can navigate the tree without needing to load the entire buffer in advance. But the layout of the buffer wasn't particularly important for JS browser workloads, so it was designed probably in a way that made it simplest to fill. The downside of a streaming reader is that the metadata is located in the header at the beginning of the file, while the tree data is laid out bottom to top. So you'd need to fetch data both from the beginning and the end, and it wouldn't be ideal for high-latency network reads. This is why having a multi-level index might be nicer, so that each individual index can be small enough to just load into memory. In related work, the FlatGeobuf file format devised a custom fork of the same My primary reason for hesitancy about |
|
I think we should do something like lucene. I am not an expert in BKD trees but just sounds like each level splits the space on x or y coordinates and so on. We can store the all node metadata like lucene in a separate (bkd_tree.lance) file and leaf separately in leaf_*.lance files |
protos/index.proto
Outdated
| message BloomFilterIndexDetails {} | ||
| message BloomFilterIndexDetails {} | ||
|
|
||
| message GeoIndexDetails {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should call this more specifically like "KdbTreeIndex" instead of a generic Geo index, so that we can have different types of geo index that can be used for different Geo columns if necessary
| static BBOX_UDF: LazyLock<ScalarUDF> = LazyLock::new(bbox); | ||
| static ST_INTERSECTS_UDF: LazyLock<ScalarUDF> = LazyLock::new(st_intersects); | ||
|
|
||
| fn st_intersects() -> ScalarUDF { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should directly use functions in geodatafusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that Pr is merged its just rebase. Also I remember there was some issue in datafusion python api where mixed case functions couldn't be used.
rust/lance-index/src/scalar/bkd.rs
Outdated
|
|
||
| /// In-memory BKD tree structure for efficient spatial queries | ||
| #[derive(Debug, DeepSizeOf)] | ||
| pub struct BKDTreeLookup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we prefer to still have camel case style for abbreviations, for example Hnsw instead of HNSW. So could you update BKD related terms to Bkd
test_geoarrow_geo_index.py
Outdated
| @@ -0,0 +1,301 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should move these as official tests in python module
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4632 +/- ##
==========================================
- Coverage 81.70% 81.63% -0.07%
==========================================
Files 334 342 +8
Lines 133293 139639 +6346
Branches 133293 139639 +6346
==========================================
+ Hits 108902 113996 +5094
- Misses 20736 21827 +1091
- Partials 3655 3816 +161
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@ddupg What do you think of the BKD tree? I have added some uts and benchmark and the index code seems okay to me |
Could you @jaystarshot elaborate further on why you no longer use RTree and instead choose for BkD tree? I'm currently reviewing this PR, and BKD tree are quite complex, I'll need some more time. Regarding geo types and functions, I suggest we directly introduce geoarrow-rs and geodatafusion. This approach enables rich geo types and functions, and avoids getting bogged down in complex geospatial knowledge. I implemented RTree #5034 precisely by doing this. Can we learn from each other's approaches to implement both RTree and BKD tree geo indexes? |
|
Thanks! I moved there because lucene and opensearch both use bkd tree (as linked) and our use case is closer to them. Also its pretty straightforward tbh at each level we sort all points in one dimension and create 2 leafs divided by a median till we reach the thresholds for leafs(for optimization use only once sorting) |
ddupg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a preliminary review of the code, I have a few suggestions/comments
|
|
||
| if inner_batch.num_rows() > 0 { | ||
| let col_idx = get_col(&inner_batch, NODE_ID)?; | ||
| let node_ids = inner_batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it be simplified to inner_batch.column_by_name(NODE_ID) ?
|
|
||
| /// Inner node in BKD tree - contains split information and child pointers | ||
| #[derive(Debug, Clone, DeepSizeOf)] | ||
| pub struct BKDInnerNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can reduce the storage space occupied by bkd tree index by eliminating some fields.
Since split_dim and split_value appear to be used only during index training and not during index searching, is serialization still necessary?
It seems possible to derive the child node ids based on the current node id, If using preorder traversal like BKDTreeBuilder::build_recursive to serialize a balanced binary tree, for example:
left_child = node_id * 2 + 1 and right_child = node_id * 2 + 2
| pub fn new(params: BkdTreeBuilderParams) -> Self { | ||
| Self { | ||
| params, | ||
| criteria: TrainingCriteria::new(TrainingOrdering::Addresses).with_row_addr(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why sorting here? When training a BKD tree in BKDTreeBuilder::build_recursive, the data is sorted by x or y values. Is sorting by address necessary here?
|
|
||
| for i in 0..batch.num_rows() { | ||
| self.points | ||
| .push((x_array.value(i), y_array.value(i), row_ids.value(i))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index training requires loading all data into memory, raising concerns about encountering an OOM error.
|
|
||
| while let Some(batch) = batches_source.try_next().await? { | ||
| // Extract GeoArrow point coordinates | ||
| let geom_array = batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value column may contain nulls, which may require separate handling.
BKD based index only for points to start with.
Only intecepts st_intersect function. I am debating whether to add other function interception in this or future PRs. Adding here might be more concise and correctness could be tested in detail but the PR could become bloated
Storage
This might also make future incremental indexing easier (we can rewrite metadata and select leafs)
Lucene code
doc1
javadoc
#4482 (comment)