Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DuckDB integration? #10887

Open
rouault opened this issue Sep 27, 2024 · 15 comments
Open

DuckDB integration? #10887

rouault opened this issue Sep 27, 2024 · 15 comments
Labels
enhancement rumination Verbiage about vague ideas

Comments

@rouault
Copy link
Member

rouault commented Sep 27, 2024

I've been fairly impressed by my testing of DuckDB. The speed of their SQL engine is really impressive. And they have a very impressive Parquet reader too (or at least the interaction with it through SQL shines). I don't know if I did something wrong with the Arrow Parquet reader (I don't think so ..?!?), but it is obvious that querying Parquet through DuckDB is much faster than through OGR SQL + Arrow Parquet, I believe especially if the result set is small compared to the table size, despite my attempts at translating OGR SQL to Arrow filtering capabilities, or that might require even further effort to fully leverage its capabilities. I dunno.

What could be very cool, but I don't know if duckdb allows it through the API, would be to have a "DuckDB SQL" dialect, similar to our current "SQLite SQL" dialect, where we could for example expose a OGR layer as an ArrowStream and make DuckDB process the SQL statement, and get back the result (obviously the DuckDB spatial extension is able to do that, using probably some internal API available to extensions).

That said, it's not all roses:

  • one major point is the packaging story of DuckDB. Which is close to be inexistant at that point, at least for duckdb as a lib. I guess it would be relative easy to improve the conda-forge duckdb package to also provide the lib & header. But I've no hope to see any time soon a libduckdb on Debian official repository given how much DuckDB vendorizes things (Debian doesn't provide libarrow in its official repositories, for similar reasons I suspect, so no jealousy here)
  • another potential pain point is that DuckDB Spatial extension statically links GDAL. so a GDAL linking against libduckdb will at some point link back to another copy of GDAL. In my testing, at my own surprise, nothing terrible happened, but this is far from being mind relieving. I do have terrible memories of ten years ago when GDAL embedded a libtiff v4 copy and also linked against libtiff v3 of the operating system, resulting in random crashes.
  • I did an attempt at prototyping a OGRPyDuckDB driver, ie. a OGR driver embedding python to use (py)duckdb, to just get a PyCapsule __arrow_c_stream__ to retrieve an ArrowStream and use it on the GDAL C++ side . It's there: rouault@422220b . It works ... as a prototype (note: I didn't do anything spatial in it) ... but my findings are that there might an inefficiency in the way DuckDB maps it "native model" to Arrow, as it seems to fully "materialize" the whole result set and lacks an easy way to specify the "page size", and my impression was even with its huge default paging, which can be an issue when their size exceed available RAM. Apologies if my conclusions are wrong (lots of new source code to figure out...) Whereas using duckdb-the-binary, I've not observed that behavior and when playing with ulimit -v, I could easily get result sets largely exceeding the allocated virtual memory. So in the current state of things, to get nice speed, one should probably implement the native DuckDB result set API, which makes things less appetizing than using the ArrowStream one
  • I have been confused by the DuckDB C API where it seems the Arrow interface is deprecated, and I didn't see any replacement... From comments it seemed this was planned, but that let me a bit puzzled.

I don't have further concrete plans for now. Just a memory dump that might raise interest or dismay :-)

Perhaps this is not a good idea after all, from an "overall ecosystem point of view", although having speedy Parquet SQL in OGR itself is very tempting ...

CC FYI @jorisvandenbossche @Maxxen @dbaston

@rouault rouault added the rumination Verbiage about vague ideas label Sep 27, 2024
@rouault
Copy link
Member Author

rouault commented Sep 27, 2024

For good measure CC @ptitjano as a QDuckDB contributor (I unfortunately don't know the github handle of Florent Fougères)

@mdsumner
Copy link
Contributor

mdsumner commented Sep 27, 2024

Including @krlmlr maintainer of duckdb R package 🙏

@cholmes
Copy link
Contributor

cholmes commented Sep 27, 2024

This would be awesome! I suspect it might be hard for OGR to get as fast as DuckDB's parquet stuff, as DuckDB is really good at optimizing to take advantage of all the cores, and they've spent a ton of time making parquet parsing work really, really well.

I was thinking of a much more lightweight integration, which is just supporting DuckDB with a vector driver. I'm usually using DuckDB to go in and out of Parquet, but sometimes it's nice to just leave it as a duckdb database, and it'd be nice to just throw that directly into QGIS. Like don't worry about getting the arrow stream with python to start - just connect to it like you do directly to other databases / formats. Not sure if that helps with any derisking of the core stuff, but it's definitely something that would be useful to me.

@mdsumner
Copy link
Contributor

@nyalldawson because excited folks are wanting QGIS to Quack 😃

@ptitjano
Copy link

ptitjano commented Sep 27, 2024

Thank you @rouault for the ping.

Indeed, we, Oslandia, have been developing a QGIS python plugin (QDuckdb) for one of your clients. It should work well and it offers a straightforward way to load and use a DuckDB database in QGIS.

However, we have some reached some limits from a performance perspective with the python QGIS overhead. We have been able to solve of them but we are limited by the architecture of a QGIS Provider which relies on an iterator and returns the rows of a table one by one. This is really costly on the python side. I think it would be very interesting to add support for the Arrow Stream (or a similar one if this one is really deprecated) which would allow to take full advantage of the DuckDB architecture.

Before making the python plugin, we asked ourselves if we could directly write a C++ provider in QGIS core. However, we would have faced some difficulties, some of them already mentioned by @rouault:

  • DuckDB packaging is still limited at the moment (In the python plugin, we embed the duckdb dependency to solve this issue)
  • the spatial extension statically links some dependencies (GDAL, PROJ, GEOS, etc...) which are dynamically linked by QGIS libraries. This probably would not work in the general case!
  • DuckDB was still in the 0.x days and it would be much easier to make evolve a python provider in case of any API break
  • it would be easier to make a c++ provider with all the experience gained from the python plugin

As a side note, the few interactions we had with the DuckDB community (mostly a few issues on github and some discord chats) have been very positive and everyone has been helpful and positive. Thank you!

I think, that an OGR driver would be a great addition, and we, Oslandia, would be very much interested to help move it forward.

cc @florentfgrs (Florent Fougères)

@Maxxen
Copy link
Contributor

Maxxen commented Sep 27, 2024

Im out drinking beer. Ill respond tomorrow but I have a lot of thoughts about this!

@rouault
Copy link
Member Author

rouault commented Sep 27, 2024

I was thinking of a much more lightweight integration, which is just supporting DuckDB with a vector driver.

Obviously! I jumped right away to the stretch goal of having a DuckDB SQL dialect, but that's clearly the ultimate step of the integration and possibly not easily reachable (or just not delivering stellar performance depending on its requirements on the input source). The traditional OGR driver has for sure to come first, reading both DuckDB native database, and cherry-on-the-cake Parquet files (or CSV etc). For that my mention of using the ArrowStream interface is actually a way of having a very small code base, because I've already coded the generic bidirectionnal OGRFeature <--> ArrowArray convertor in past GDAL versions, and which I've leveraged in my toy/demo OGRPyDuckDB prototype. So if the ArrowStream interface of DuckDB is available through C or C++ (the use of the Python duckdb package was because it provided a """convenient""" way of getting an ArrowStream, ahem pending embedding Python in C++, but there's precedent in GDAL. Anyway that's definitely not the target architecture I'm thinking about) and that getting data through the ArrowStream interface leads to optimal or near optimal performance. My naive thoughts are that DuckDB being columnar by nature, that should in theory be a cheap adaptation layer, but reality might be more complex if DuckDB uses optimized representations (which I believe they do), but only experts on DuckDB core could confirm/infirm that. Otherwise we might have to use its specific API to get the best of the performance, sacrificing a bit the convenience of the implementation.
But for now, all points under "not all roses" section should be considered as blockers to solve for the effort to succeed.

Im out drinking beer.

Enjoy it and your week end. No emergency. If I've labelled this ticket as "rumination", this is a hint that on my side at least there's no identified funding vehicle to make that happen.

@Guts
Copy link

Guts commented Sep 28, 2024

As minor contributor to QDuckDB but really impressed by the DuckDB project, I'm excited to see its support coming in GDAL!

What could be very cool, but I don't know if duckdb allows it through the API, would be to have a "DuckDB SQL" dialect, similar to our current "SQLite SQL" dialect[...]

Exactly what I would expect!

@florentfgrs
Copy link

Glad to see interest in integrating DuckDB into GDAL. Can't wait to see what happens next!

@Maxxen
Copy link
Contributor

Maxxen commented Oct 2, 2024

Sooo... argh, where to begin? I guess ill just start by responding to some of the points brought up:

What could be very cool, but I don't know if duckdb allows it through the API, would be to have a "DuckDB SQL" dialect, similar to our current "SQLite SQL" dialect.

Potentially, we're experimenting with "parser extensions", but there are already some extensions that do add new syntax, like PRQL and duckdb-pgq. Although these integrate very deeply with DuckDB's unstable c++ internals so I don't know how sustainable doing something similar right now would be in the long-term.

one major point is the packaging story of DuckDB.

Yes, iirc we at one point attempted to ship a pre-built static binary of the duckdb library to VCPKG, but it was rejected as they did not like us vendoring our own version of third party dependencies. Im not sure if e.g. Debian have similar qualms but I would expect so.

One thing that could be interesting for GDAL is to use DuckDB's amalgamation build, which gives you a single huge c++ file + header of the entire duckdb source code (sqlite style). This is how we've dealt with distributing DuckDB in a few of the client libraries (e.g. swift, golang). We also include the parquet extension in the amalgamation build by default, and it should be possible to embed others as well with some modification to the generation script. The downside of this is of course that you may end up needing a pretty beefy machine to actually compile it.

I have been confused by the DuckDB C API

We've made a lot of changes to the way result sets are handled In the C-API and the plan is to replace the current arrow api with something better in the future (that is also more performant/doesnt have to materialize everything). We ended up marking things as deprecated even though there are no replacements yet with the idea that it is better to be early with this sort of stuff, but you're not alone in finding that confusing haha.

another potential pain point is that DuckDB Spatial extension statically links GDAL.

I think this is probably the biggest challenge regarding integrating DuckDB in GDAL. There's a couple of different angles to this.

One idea is to consider integrating DuckDB only for its parquet capabilities. But even if we ignore the spatial extension, there is an implicit dependency between the two when reading/writing GeoParquet. Basically, the parquet extension will look through the DuckDB catalog to find the ST_GeomFromWKB, ST_AsWKB , ST_Extent and ST_ZMFlag functions. Although it is possible to do the conversion and metadata reading from outside DuckDB when reading, but it does get a lot more complicated when writing as you would have to perform the extent/geometry type aggregation separately and somehow slap it on to the parquet footer after the file is written...?. Alternatively we could also provide these required functions separately in a sort of statically linked "mini" spatial extension just to enable DuckDB's geoparquet conversion for GDAL.

Another approach, that I've had in the back of my mind as one of the long-term goals for DuckDB spatial is to remove GDAL from the spatial extension itself, and ship a spatial aware DuckDB driver for GDAL, thus "inverting" the dependency. This is a pretty radical move, (perhaps, equally radical as embedding GDAL in the first place) but I do think it makes a lot of sense.

  • Compiling and shipping GDAL statically for all of our supported platforms is a lot of work.
  • Embedding a static GDAL prevent users from using optional/different versions of GDAL drivers.
  • DuckDB really wants to have a lot more control over resource/system usage and provides its own platform abstractions, which don't always cleanly map to GDAL.
    • DuckDB can't enforce GDAL respecting limits for threads/memory.
    • Mapping DuckDB's filesystem abstraction to GDAL's has been the source of a lot of subtle bugs, particularly when it comes to remote/virtual filesystems, and its hard to adapt to the access patterns of different drivers leading to surprising performance regressions.
    • GDAL still has some use of application-wide global state (e.g. filesystem handlers), while everything in DuckDB is scoped to a connection. This has required quite some hacks to work around and I'm honestly not sure if there are more stuff that remains a bit racy.
  • My impression is that 95% of all users use DuckDB's GDAL-based st_read function to import/export only a handful of vector formats. (FGB, SHP, GeoJSON). We already have an experimental native shapefile reader and I don't think FGB or GeoJSON are unfeasible to add as well.
  • While its mostly fine to use C++17 in duckdb extensions (as would be required if we want to keep up with the latest GDAL development) it does add new requirements on our CI and impact portability slightly. Ideally we stick to the same runtime requirements as mainline DuckDB, that is C++11.

For users that still want to use GDAL for some of the more obscure formats, we could create a GDAL extension that is separate from spatial, perhaps using DuckDB's new stable C-extension-API, (which leads to MUCH smaller extension binaries and build times) as well as benefit from future build improvements in GDAL.

The main things that need to happen for this is:

  • Add FGB and GeoJSON import/export for spatial (I have an old FGB implementation prototype somewhere)
  • Finish up DuckDB's own excel extension. A side effect of embedding GDAL is that people now end up depending on spatial for reading/writing excel files... which is a bit messy. I just need to figure out excel timestamp formatting (... famous last words).

One question would be how to deal with PROJ/ the proj database, as that would need most likely still need to be embedded in both spatial and the GDAL extension (and maybe will also be embedded in GDAL itself in the future), and it feels a bit silly to have this big 9mb blob duplicated between all of these binaries. But we are currently thinking of different ways to slim down the PROJ database we ship and/or make it configurable/downloadable at runtime anyway - we're starting to get other DuckDB extensions that want to deal with external resource files at runtime.

@Maxxen
Copy link
Contributor

Maxxen commented Oct 2, 2024

Another approach, that I've had in the back of my mind as one of the long-term goals for DuckDB spatial is to remove GDAL from the spatial extension itself

Maybe it goes without saying, but I don't want this to reflect negatively on GDAL. Embedding GDAL into spatial was a fantastic way to make the spatial extension really useful from the get-go and generated a lot of attention for us early on. It was only really feasible thanks to the amazing work championed in #5830. But as both DuckDB and spatial has matured the last two years it would be awesome if we could not just benefit from GDAL but also be part of what makes GDAL great!

@rouault
Copy link
Member Author

rouault commented Oct 2, 2024

The downside of this is of course that you may end up needing a pretty beefy machine to actually compile it.

I would expect most packagers to actively hate that, even more than the vendored DuckDB build, and just remove the amalgamated DuckDB-within-GDAL, or not enable it, which would defeat the purpose. The packager-friendly solution would be for DuckDB to offer the possibility to link against "system" / external versions of its dependencies, at least for those that are already packaged (there's obvioulsy a fair amount of work to do to be able to link against external libs). Packagers don't like vendored components because that's a nightmare to deal with when there is a security flaw in one of them: they have to identify all software that vendorize it and patch them.

@rouault
Copy link
Member Author

rouault commented Oct 13, 2024

#11003 adds a ADBC driver that can use libduckdb to read Parquet or DuckDB datasets. Before you get too excited, no spatial support for now :-) but should be doable as follow-up. This will be GDAL 3.11 material, as we are too close to the imminent 3.10 feature freeze

@paleolimbot
Copy link
Contributor

and ship a spatial aware DuckDB driver for GDAL, thus "inverting" the dependency.

Just a note that the reverse is also possible (export an ADBC driver init function from GDAL that wraps OGR). I am not sure that would help today because DuckDB I don't believe can federate via ADBC, but in theory it could work quite nicely with one copy of GDAL and one copy of DuckDB (e.g., installed via pip) registered with the Python driver manager (or R, or some application embedding both).

@rouault
Copy link
Member Author

rouault commented Oct 29, 2024

ADBC driver now merged into GDAL master / 3.11.0dev: https://gdal.org/en/latest/drivers/vector/adbc.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement rumination Verbiage about vague ideas
Projects
None yet
Development

No branches or pull requests

8 participants