Skip to content

Conversation

@ddupg
Copy link
Contributor

@ddupg ddupg commented Sep 9, 2025

No description provided.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 9, 2025

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@ddupg
Copy link
Contributor Author

ddupg commented Sep 9, 2025

The PR is currently blocked by pending releases of datafusion and geodatafusion(geoarrow/geoarrow-rs#1317). These upstream dependencies need to stabilize before we can finalize the geospatial type integration.

@jackye1995
Copy link
Contributor

@ddupg I just merged #4598, some of these can probably be cleaned up and rebased now

geoarrow = { workspace = true }
geoarrow-array = { workspace = true }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend depending only on geoarrow-array and geoarrow-schema, if those are the only two sub-crates you need. That'll reduce compilation times

#[tokio::test]
async fn test_geo_types() {
use geo_types::{coord, line_string, Rect};
use geoarrow::array::{LineStringBuilder, PointBuilder, PolygonBuilder};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can import these from geoarrow_array

Comment on lines 7806 to 7808
use geoarrow::datatypes::{
Dimension, GeoArrowType, LineStringType, PointType, PolygonType,
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And these from geoarrow_schema

let polygon_type = PolygonType::new(Dimension::XY, Default::default());

let schema = arrow_schema::Schema::new(vec![
GeoArrowType::from(point_type.clone()).to_field("point", true),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there's a to_field method on PointType, so you don't have to convert to GeoArrowType.

@kylebarron
Copy link

FWIW I thought I remembered having a discussion at some point about supporting geospatial indexes in Lance files, though I can't find it anymore.

I thought it would be cool to embed an R-Tree or K-D tree index inside of a lance file and use that for efficient spatial queries. I have the https://github.com/kylebarron/geo-index crate, ported from the original JavaScript which defines a binary format for serializing and deserializing 2D R-Tree/K-D trees (most of the time geospatial support is OK with only 2-dimensional indexes).

If you wanted to push more on geospatial support here, I think adding support for something like geo-index would be super cool. I'd be happy to review PRs but I probably won't have the time to drive that myself.

@jackye1995
Copy link
Contributor

FWIW I thought I remembered having a discussion at some point about supporting geospatial indexes in Lance files, though I can't find it anymore.

I thought it would be cool to embed an R-Tree or K-D tree index inside of a lance file and use that for efficient spatial queries. I have the https://github.com/kylebarron/geo-index crate, ported from the original JavaScript which defines a binary format for serializing and deserializing 2D R-Tree/K-D trees (most of the time geospatial support is OK with only 2-dimensional indexes).

+1

@kylebarron we can probably discuss more in the index PR #4632 or in the discussion at #4482. For all the other files, we also store the index contents in Lance files. It would be great if we can keep it consistent for the geo index. cc @jaystarshot

@kylebarron
Copy link

@kylebarron we can probably discuss more in the index PR #4632 or in the discussion at #4482. For all the other files, we also store the index contents in Lance files. It would be great if we can keep it consistent for the geo index. cc @jaystarshot

Agreed; I saw that PR after I commented here!

@ddupg ddupg force-pushed the feat/geo-type-expr branch 2 times, most recently from 430db38 to ed476bc Compare September 10, 2025 06:55
westonpace pushed a commit that referenced this pull request Oct 3, 2025
This PR is to update the following dependencies:
- DataFusion to 50.0.0
- Arrow to 56.1

I am working on supporting GEO type #4678 , which is blocked by the
DataFusion upgrade. Recently, DataFusion released 50
apache/datafusion#16799 , so follow up on the
upgrade

Closes #4753
@ddupg ddupg force-pushed the feat/geo-type-expr branch 3 times, most recently from 0470ec4 to 84d19a7 Compare October 9, 2025 08:21
@codecov-commenter
Copy link

codecov-commenter commented Oct 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@ddupg ddupg force-pushed the feat/geo-type-expr branch 5 times, most recently from dab5e06 to 72d886c Compare October 11, 2025 07:21
@jackye1995
Copy link
Contributor

@ddupg I think we can revive this PR now since all the upgrades are in, is this ready for another look?

Cargo.toml Outdated
futures = "0.3"
#geoarrow-array = "0.5.0"
#geoarrow-schema = "0.5.0"
geoarrow-array = { git = "https://github.com/geoarrow/geoarrow-rs", rev = "61a535b072766003ad06a4d7a25dcc15f010e68f", version = "0.5" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least make this a dedicated feature geo instead of directly in the main cargo. But even with that, we are inflating the dependency further because we have to have all features to do the python binding, which is against what we wanna do in #2224.

This makes me think should we make this a separated lance-geo (or geolance?) package? That also matches the pattern of having geoarrow, geoparquet, etc. as separated projects. What do we think? @westonpace @wjones127 @Xuanwo

Copy link
Contributor

@jackye1995 jackye1995 Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even more simply, we can just make these documentations, so users know how to use it in python & rust, what package to import, how to construct Arrow schema, what DataFusion UDFs to register, etc. We don't really need to have any dedicated code integrations until at some point in the future we are thinking about doing any special encoding for these Geo types, but that seems unlikely for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cc @jaystarshot curious about your opinion here as well.

Copy link
Contributor Author

@ddupg ddupg Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even more simply, we can just make these documentations, so users know how to use it in python & rust, what package to import, how to construct Arrow schema, what DataFusion UDFs to register, etc. We don't really need to have any dedicated code integrations until at some point in the future we are thinking about doing any special encoding for these Geo types, but that seems unlikely for now.

The current aim is to implement geo index accelerated UDF queries, hence the intention to introduce geo dependencies into Lance. Otherwise, it is indeed possible to rely solely on the documentation.

This makes me think should we make this a separated lance-geo (or geolance?) package?

It is a good idea to make a separated lance-geo package or repo. If so, I'm a bit confused about how Python or Java users would import it.

Copy link
Contributor

@jackye1995 jackye1995 Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current aim is to implement geo index accelerated UDF queries, hence the intention to introduce geo dependencies into Lance.

I think we mainly do it through checking the actual function name, and it does not necessary mean you need to import the function?

For example in FTS, we have the FTS query parser as:

fn visit_scalar_function(
        &self,
        column: &str,
        data_type: &DataType,
        func: &ScalarUDF,
        args: &[Expr],
    ) -> Option<IndexedExpression> {
        if args.len() != 2 {
            return None;
        }
        let scalar = maybe_scalar(&args[1], data_type)?;
        if let ScalarValue::Utf8(Some(scalar_str)) = scalar {
            if func.name() == "contains_tokens" {
                let query = TokenQuery::TokensContains(scalar_str);
                return Some(IndexedExpression::index_query(
                    column.to_string(),
                    self.index_name.clone(),
                    Arc::new(query),
                ));
            }
        }
        None
    }

but that does not mean the function really exists, since we are just checking function name here, the function registration can happen elsewhere against the DataFusion context.

I think we can do similar things for geo functions to match names like "st_intersects". What would be the challenge of doing this?

I guess that requires us to align with the positional arguments in the function, but I think that seems reasonable as a way to decouple the 2.

@ddupg
Copy link
Contributor Author

ddupg commented Oct 16, 2025

@ddupg I think we can revive this PR now since all the upgrades are in, is this ready for another look?

There are still a few blocking points at present:

@kylebarron
Copy link

I'm pretty much just trying to flesh out documentation now (e.g. datafusion-contrib/geodatafusion#34). geoarrow 0.6 is published.

  • It may be necessary to wait Arrow 57, as the pyo3 conflicts introduced by geoarrow.

You don't seem to be using pyo3-geoarrow, so it doesn't look like your use of geoarrow depends on pyo3 at all

@ddupg ddupg force-pushed the feat/geo-type-expr branch 2 times, most recently from 6f4cb94 to 9d1daa3 Compare October 17, 2025 03:12
@ddupg ddupg force-pushed the feat/geo-type-expr branch 2 times, most recently from 8848e24 to 258f4a3 Compare October 17, 2025 03:39
@ddupg
Copy link
Contributor Author

ddupg commented Oct 17, 2025

You don't seem to be using pyo3-geoarrow, so it doesn't look like your use of geoarrow depends on pyo3 at all

I tested with geoarrow 0.6 and geodatafusion 0.1.0-beta and it passed, pyo3 is not blocking.

@ddupg ddupg force-pushed the feat/geo-type-expr branch from 56e8d51 to 8ace4a0 Compare October 17, 2025 07:31
@ddupg
Copy link
Contributor Author

ddupg commented Oct 22, 2025

The current aim is to implement geo index accelerated UDF queries, hence the intention to introduce geo dependencies into Lance.

I think we mainly do it through checking the actual function name, and it does not necessary mean you need to import the function?

For example in FTS, we have the FTS query parser as:

fn visit_scalar_function(
        &self,
        column: &str,
        data_type: &DataType,
        func: &ScalarUDF,
        args: &[Expr],
    ) -> Option<IndexedExpression> {
        if args.len() != 2 {
            return None;
        }
        let scalar = maybe_scalar(&args[1], data_type)?;
        if let ScalarValue::Utf8(Some(scalar_str)) = scalar {
            if func.name() == "contains_tokens" {
                let query = TokenQuery::TokensContains(scalar_str);
                return Some(IndexedExpression::index_query(
                    column.to_string(),
                    self.index_name.clone(),
                    Arc::new(query),
                ));
            }
        }
        None
    }

but that does not mean the function really exists, since we are just checking function name here, the function registration can happen elsewhere against the DataFusion context.

I think we can do similar things for geo functions to match names like "st_intersects". What would be the challenge of doing this?

I guess that requires us to align with the positional arguments in the function, but I think that seems reasonable as a way to decouple the 2.

Based on offline discussions with @jackye1995 , geospatial index requires parsing geospatial data, which is highly complex. Since geoarrow-rs has already implemented this effectively, we should introduce geoarrow-rs as a dependency into Lance.

geodatafusion does not require direct dependency in Lance, we only need to align with the positional arguments in the function of geodatafusion. Additionally, documentation should be provided explaining how to integrate geodatafusion.

Therefore, I don't think we need a separate pull request for geoarrow. I've already implemented the RTree index #5034 , where geoarrow was introduced.

@ddupg ddupg force-pushed the feat/geo-type-expr branch from 8ace4a0 to 275ba7d Compare November 26, 2025 09:05
@ddupg ddupg changed the title [WIP] feat: support GEO types feat: support GEO types Nov 26, 2025
@github-actions github-actions bot added the enhancement New feature or request label Nov 26, 2025
@ddupg ddupg force-pushed the feat/geo-type-expr branch 5 times, most recently from 0a3de01 to 56cb243 Compare November 26, 2025 10:54
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending CI fix to rebase

@jackye1995 jackye1995 force-pushed the feat/geo-type-expr branch 2 times, most recently from 982bcf3 to 553deff Compare November 27, 2025 00:47
@ddupg ddupg force-pushed the feat/geo-type-expr branch 2 times, most recently from 9a3ebbd to 7937133 Compare November 27, 2025 02:21
@jackye1995
Copy link
Contributor

Sorry merged another PR, could you rebase again?

@ddupg ddupg force-pushed the feat/geo-type-expr branch from 7937133 to 40fdf27 Compare November 27, 2025 02:47
@ddupg
Copy link
Contributor Author

ddupg commented Nov 27, 2025

Sorry merged another PR, could you rebase again?

Sure thing, I've manually rebased and resolved the conflicts.

Change-Id: Ie3f0bbee0f4dd89ff42d2e7eaca6597887b2c252
@ddupg ddupg force-pushed the feat/geo-type-expr branch from 40fdf27 to a383992 Compare November 27, 2025 03:05
@jackye1995
Copy link
Contributor

Thanks for the contribution, this is huge!!! 🚀

@jackye1995 jackye1995 merged commit b133306 into lance-format:main Nov 27, 2025
24 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants