-
Notifications
You must be signed in to change notification settings - Fork 490
feat: support GEO types #4678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support GEO types #4678
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
7a5c774 to
b1769c9
Compare
|
The PR is currently blocked by pending releases of datafusion and geodatafusion(geoarrow/geoarrow-rs#1317). These upstream dependencies need to stabilize before we can finalize the geospatial type integration. |
rust/lance-arrow/Cargo.toml
Outdated
| geoarrow = { workspace = true } | ||
| geoarrow-array = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend depending only on geoarrow-array and geoarrow-schema, if those are the only two sub-crates you need. That'll reduce compilation times
rust/lance/src/dataset.rs
Outdated
| #[tokio::test] | ||
| async fn test_geo_types() { | ||
| use geo_types::{coord, line_string, Rect}; | ||
| use geoarrow::array::{LineStringBuilder, PointBuilder, PolygonBuilder}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can import these from geoarrow_array
rust/lance/src/dataset.rs
Outdated
| use geoarrow::datatypes::{ | ||
| Dimension, GeoArrowType, LineStringType, PointType, PolygonType, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And these from geoarrow_schema
rust/lance/src/dataset.rs
Outdated
| let polygon_type = PolygonType::new(Dimension::XY, Default::default()); | ||
|
|
||
| let schema = arrow_schema::Schema::new(vec![ | ||
| GeoArrowType::from(point_type.clone()).to_field("point", true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI there's a to_field method on PointType, so you don't have to convert to GeoArrowType.
|
FWIW I thought I remembered having a discussion at some point about supporting geospatial indexes in Lance files, though I can't find it anymore. I thought it would be cool to embed an R-Tree or K-D tree index inside of a lance file and use that for efficient spatial queries. I have the https://github.com/kylebarron/geo-index crate, ported from the original JavaScript which defines a binary format for serializing and deserializing 2D R-Tree/K-D trees (most of the time geospatial support is OK with only 2-dimensional indexes). If you wanted to push more on geospatial support here, I think adding support for something like geo-index would be super cool. I'd be happy to review PRs but I probably won't have the time to drive that myself. |
+1 @kylebarron we can probably discuss more in the index PR #4632 or in the discussion at #4482. For all the other files, we also store the index contents in Lance files. It would be great if we can keep it consistent for the geo index. cc @jaystarshot |
Agreed; I saw that PR after I commented here! |
430db38 to
ed476bc
Compare
This PR is to update the following dependencies: - DataFusion to 50.0.0 - Arrow to 56.1 I am working on supporting GEO type #4678 , which is blocked by the DataFusion upgrade. Recently, DataFusion released 50 apache/datafusion#16799 , so follow up on the upgrade Closes #4753
0470ec4 to
84d19a7
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
dab5e06 to
72d886c
Compare
|
@ddupg I think we can revive this PR now since all the upgrades are in, is this ready for another look? |
Cargo.toml
Outdated
| futures = "0.3" | ||
| #geoarrow-array = "0.5.0" | ||
| #geoarrow-schema = "0.5.0" | ||
| geoarrow-array = { git = "https://github.com/geoarrow/geoarrow-rs", rev = "61a535b072766003ad06a4d7a25dcc15f010e68f", version = "0.5" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should at least make this a dedicated feature geo instead of directly in the main cargo. But even with that, we are inflating the dependency further because we have to have all features to do the python binding, which is against what we wanna do in #2224.
This makes me think should we make this a separated lance-geo (or geolance?) package? That also matches the pattern of having geoarrow, geoparquet, etc. as separated projects. What do we think? @westonpace @wjones127 @Xuanwo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even more simply, we can just make these documentations, so users know how to use it in python & rust, what package to import, how to construct Arrow schema, what DataFusion UDFs to register, etc. We don't really need to have any dedicated code integrations until at some point in the future we are thinking about doing any special encoding for these Geo types, but that seems unlikely for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cc @jaystarshot curious about your opinion here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even more simply, we can just make these documentations, so users know how to use it in python & rust, what package to import, how to construct Arrow schema, what DataFusion UDFs to register, etc. We don't really need to have any dedicated code integrations until at some point in the future we are thinking about doing any special encoding for these Geo types, but that seems unlikely for now.
The current aim is to implement geo index accelerated UDF queries, hence the intention to introduce geo dependencies into Lance. Otherwise, it is indeed possible to rely solely on the documentation.
This makes me think should we make this a separated lance-geo (or geolance?) package?
It is a good idea to make a separated lance-geo package or repo. If so, I'm a bit confused about how Python or Java users would import it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current aim is to implement geo index accelerated UDF queries, hence the intention to introduce geo dependencies into Lance.
I think we mainly do it through checking the actual function name, and it does not necessary mean you need to import the function?
For example in FTS, we have the FTS query parser as:
fn visit_scalar_function(
&self,
column: &str,
data_type: &DataType,
func: &ScalarUDF,
args: &[Expr],
) -> Option<IndexedExpression> {
if args.len() != 2 {
return None;
}
let scalar = maybe_scalar(&args[1], data_type)?;
if let ScalarValue::Utf8(Some(scalar_str)) = scalar {
if func.name() == "contains_tokens" {
let query = TokenQuery::TokensContains(scalar_str);
return Some(IndexedExpression::index_query(
column.to_string(),
self.index_name.clone(),
Arc::new(query),
));
}
}
None
}
but that does not mean the function really exists, since we are just checking function name here, the function registration can happen elsewhere against the DataFusion context.
I think we can do similar things for geo functions to match names like "st_intersects". What would be the challenge of doing this?
I guess that requires us to align with the positional arguments in the function, but I think that seems reasonable as a way to decouple the 2.
There are still a few blocking points at present:
|
I'm pretty much just trying to flesh out documentation now (e.g. datafusion-contrib/geodatafusion#34). geoarrow 0.6 is published.
You don't seem to be using |
6f4cb94 to
9d1daa3
Compare
8848e24 to
258f4a3
Compare
I tested with geoarrow 0.6 and geodatafusion 0.1.0-beta and it passed, pyo3 is not blocking. |
56e8d51 to
8ace4a0
Compare
Based on offline discussions with @jackye1995 , geospatial index requires parsing geospatial data, which is highly complex. Since geoarrow-rs has already implemented this effectively, we should introduce geoarrow-rs as a dependency into Lance. geodatafusion does not require direct dependency in Lance, we only need to align with the positional arguments in the function of geodatafusion. Additionally, documentation should be provided explaining how to integrate geodatafusion. Therefore, I don't think we need a separate pull request for geoarrow. I've already implemented the RTree index #5034 , where geoarrow was introduced. |
8ace4a0 to
275ba7d
Compare
0a3de01 to
56cb243
Compare
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pending CI fix to rebase
982bcf3 to
553deff
Compare
9a3ebbd to
7937133
Compare
|
Sorry merged another PR, could you rebase again? |
7937133 to
40fdf27
Compare
Sure thing, I've manually rebased and resolved the conflicts. |
Change-Id: Ie3f0bbee0f4dd89ff42d2e7eaca6597887b2c252
40fdf27 to
a383992
Compare
|
Thanks for the contribution, this is huge!!! 🚀 |
No description provided.