Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representation of spatial types on export to ArrowArrayStream #153

Open
paleolimbot opened this issue Oct 24, 2023 · 13 comments
Open

Representation of spatial types on export to ArrowArrayStream #153

paleolimbot opened this issue Oct 24, 2023 · 13 comments
Labels
enhancement New feature or request

Comments

@paleolimbot
Copy link

Currently when we export a spatial column to Arrow, we get:

import duckdb
duckdb.sql("LOAD spatial;")
duckdb.sql("SELECT ST_GeomFromText('POINT (0 1)') as geom").to_arrow_table()
#> pyarrow.Table
#> geom: binary
#> ----
#> geom: [[000020000000000000000000010000000000000000000000000000000000F03F]]

With geoarrow-python we can register extension types to propagate CRS and type name through pyarrow machinery:

import geoarrow.pyarrow as ga
array = ga.with_crs(ga.as_wkb(["POINT (0 1)"]), '{<some projjson>}')
array
#> GeometryExtensionArray:WkbType(geoarrow.wkb <{<some projjson>}>)[1]
#> <POINT (0 1)>
array.type.extension_name
#> 'geoarrow.wkb'
array.type.__arrow_ext_serialize__()
#> b'{"crs":"{<some projjson>}"}'

Would it be appropriate to export geometry columns to GeoArrow extension arrays (i.e., with ARROW:extension:name and ARROW:extension:type set according to https://github.com/geoarrow/geoarrow/blob/main/extension-types.md ? It looks like what is exported by default is your internal binary representation (perhaps not for the "native" unserialized types), but I think there are other types that are reencoded on export to Arrow (e.g., boolean columns are bitpacked).

@Maxxen
Copy link
Member

Maxxen commented Oct 24, 2023

Of course! Now that GeoArrow is more mature we should definitely make sure that we can export our geometry types properly to it.

I'm not entirely sure how much we can influence the arrow functionality that duckdb provides in its core from an extension when it comes to custom types, or if that would require some changes to upstream duckdb, but I'll look into it for sure.

@paleolimbot
Copy link
Author

Awesome! This may be a little deep into the internals of DuckDB extension land for me to be of much help, but with some scaffold of where to start I would certainly be willing to try!

@kylebarron
Copy link

kylebarron commented Oct 24, 2023

Just chiming in that I would love tighter integration between duckdb spatial and geoarrow! Let me know if I can help; I just don't know C++

@Maxxen
Copy link
Member

Maxxen commented Oct 24, 2023

So I think the main difficulty here is enabling DuckDB's types to carry arbitrary metadata. As of now a "custom" duckdb type is really just a wrapper of another type + an alias. Somehow DuckDB core needs to be able to identify that the custom types we register in spatial should be encoded differently in arrow, and also figure out how to do that encoding.

This is the same problem we have with GeoParquet, we don't want extensions to have dependencies on each other (or even be aware of each other), so all communication has to go through DuckDB itself in a generic way.

I don't think it's too unreasonable to allow custom types to also carry arbitrary key-value data (similar to how parquet/arrow work?) that can be interpreted at different encoding/decoding endpoints. Although for GeoArrow I guess we'll need to be able to communicate how to decode the coordinate data from the binary format, maybe by passing a function pointer as the value for a "meta::arrow::serialization_func" key? I'll have to talk to the rest of the team some more, there might be an easier way to do this at the python level as well.

@paleolimbot
Copy link
Author

Let us know! It seems like it would be a huge win for all extensions that implement types to customize Arrow in/out but I know little about the bigger picture!

@kylebarron
Copy link

I'll have to talk to the rest of the team some more, there might be an easier way to do this at the python level as well.

Python is a big use case, but it would be great to have C-based Arrow interop as well. Personally I'd love to connect duckdb spatial to the geoarrow-native algorithms I'm developing in Rust (wrapping georust), especially if there's no copy out of duckdb

@kylebarron
Copy link

One of the projects I'm working on is lonboard, a geoarrow-based interactive data visualization tool that can handle millions of coordinates. I'd love to have an example of using it with duckdb-spatial, visualizing the result of a query. And since DuckDB attributes are Arrow-like, there should be minimal conversion overhead. But this would be be much easier (and simpler for users) if DuckDB spatial could include GeoArrow metadata on its exports, even if that's just marking the geometry column as geoarrow.wkb.

@paleolimbot
Copy link
Author

@Maxxen would it be possible to point to a few relevant places in the code base where these things are handled (e.g., where it gets decided what Arrow type to output, or where where the extension interface would need to be modified to make that happen)? I may have some time to take a stab at this in the next month or so.

@Maxxen
Copy link
Member

Maxxen commented Feb 26, 2024

@paleolimbot Sure! Here's my thoughts

One way to solve this is to introduce some sort of new ArrowConversionExtension object in the DBConfig that extensions can subclass and provide a callback(s) (I guess both ways) to some sort of arrow conversion routine, similar to how e.g. OptimizerExtension works. The arrow conversion code in arrow_converter.cpp could then first try to check if the type has an alias (e.g, is a custom type) and if that alias has a matching ArrowConversionExtension in the current config, accessible from the ClientContext, which is almost always threaded down everywhere. You'd probably have to do this check higher up anyway, the alias is only available on the LogicalType itself, once you enter the switch on LogicalTypeId it's too late. Im not actually too familiar with all the arrow stuff but it all lives inside DuckDB core. Here's a commit from a recent PR where I added Arrow support (and python object conversion) for the new ARRAY type.

However, this would be a hardcoded solution for just Arrow. What if we want to de/serialize our custom type differently in e.g. JSON or Parquet? Im going to work on GeoParquet soon so it's not entirely a what-if. However special casing this for Arrow might not be such a bad idea regardless and could be a good first step for custom type conversion which could be generalized/unified further in the future, so don't let me dissuade you :).

I guess you would maybe need some way to let the users control how this conversion is done though? I could imagine some users just want DuckDB geometries as an arrow array of WKB blobs in some cases, or want to decide if they want interleaved/split coordinate arrays. You could maybe do this with a database option (e.g. SET use_interleaved_arrow_spatial_conversion = TRUE) but it would be nicer to control per-query. Maybe you could pass on an optional list of arguments (kwargs-style) from the arrow conversion table function into the extension callback to interpret as they please. I think that's how our clients that can output arrow work (node, python), they just call a special table function... but im note sure? You would then need to expose those extra optional arguments in the client APIs too.


The bigger problem is that we would like to enable custom types to have their own custom "metadata". This metadata could be used to communicate attributes of types across extensions and core DuckDB. E.g. DuckDB could inspect this metadata when encountering an unknown custom type to see if it contains a specially labeled entry to a function pointer to a custom arrow conversion function. But this gets complicated as soon as you want to attach stuff to the type itself that's not easily persistable (like a function pointer) as you now need to defer the deserialization/binding of the type to after the required extension is loaded and can provide this ephemeral data again.

I did do some work on general custom metadata in this branch a while back but ran into a dead-end regarding persistence. I did talk to Mark about it and we sketched out a better plan for it but I haven't touched the branch since as I got busy with other stuff.

In that branch I was mostly concerned with using type metadata to enable custom types to have "parameters", e.g. having the geometry type optionally be parameterized over SRID GEOMETRY(EPGS:4326). We'd need to be a somewhat careful when designing how the casting rules should interpret differently parameterized type, but that's another issue.

@paleolimbot
Copy link
Author

Thank you for the thoughts! In the next few weeks I'll do some poking and see if I can come up with anything useful.

@Maxxen
Copy link
Member

Maxxen commented May 6, 2024

Hey @paleolimbot did you end up working on this? Im thinking of revisiting this soon and have some fresh ideas on how to make this happen without blowing up the scope with all the custom type metadata stuff

@paleolimbot
Copy link
Author

I did not end up working on this despite really wanting to! I am happy to help review (and will have more time after nanoarrow 0.5.0 is released in a week or two).

@Maxxen
Copy link
Member

Maxxen commented May 6, 2024

Great! Ill try to give it a shot soon then and ping you once I got something!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants