-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DuckDB integration? #10887
Comments
Including @krlmlr maintainer of duckdb R package 🙏 |
This would be awesome! I suspect it might be hard for OGR to get as fast as DuckDB's parquet stuff, as DuckDB is really good at optimizing to take advantage of all the cores, and they've spent a ton of time making parquet parsing work really, really well. I was thinking of a much more lightweight integration, which is just supporting DuckDB with a vector driver. I'm usually using DuckDB to go in and out of Parquet, but sometimes it's nice to just leave it as a duckdb database, and it'd be nice to just throw that directly into QGIS. Like don't worry about getting the arrow stream with python to start - just connect to it like you do directly to other databases / formats. Not sure if that helps with any derisking of the core stuff, but it's definitely something that would be useful to me. |
@nyalldawson because excited folks are wanting QGIS to Quack 😃 |
Thank you @rouault for the ping. Indeed, we, Oslandia, have been developing a QGIS python plugin (QDuckdb) for one of your clients. It should work well and it offers a straightforward way to load and use a However, we have some reached some limits from a performance perspective with the python QGIS overhead. We have been able to solve of them but we are limited by the architecture of a QGIS Provider which relies on an iterator and returns the rows of a table one by one. This is really costly on the python side. I think it would be very interesting to add support for the Arrow Stream (or a similar one if this one is really deprecated) which would allow to take full advantage of the Before making the python plugin, we asked ourselves if we could directly write a C++ provider in QGIS core. However, we would have faced some difficulties, some of them already mentioned by @rouault:
As a side note, the few interactions we had with the I think, that an OGR driver would be a great addition, and we, Oslandia, would be very much interested to help move it forward. cc @florentfgrs (Florent Fougères) |
Im out drinking beer. Ill respond tomorrow but I have a lot of thoughts about this! |
Obviously! I jumped right away to the stretch goal of having a DuckDB SQL dialect, but that's clearly the ultimate step of the integration and possibly not easily reachable (or just not delivering stellar performance depending on its requirements on the input source). The traditional OGR driver has for sure to come first, reading both DuckDB native database, and cherry-on-the-cake Parquet files (or CSV etc). For that my mention of using the ArrowStream interface is actually a way of having a very small code base, because I've already coded the generic bidirectionnal OGRFeature <--> ArrowArray convertor in past GDAL versions, and which I've leveraged in my toy/demo OGRPyDuckDB prototype. So if the ArrowStream interface of DuckDB is available through C or C++ (the use of the Python duckdb package was because it provided a """convenient""" way of getting an ArrowStream, ahem pending embedding Python in C++, but there's precedent in GDAL. Anyway that's definitely not the target architecture I'm thinking about) and that getting data through the ArrowStream interface leads to optimal or near optimal performance. My naive thoughts are that DuckDB being columnar by nature, that should in theory be a cheap adaptation layer, but reality might be more complex if DuckDB uses optimized representations (which I believe they do), but only experts on DuckDB core could confirm/infirm that. Otherwise we might have to use its specific API to get the best of the performance, sacrificing a bit the convenience of the implementation.
Enjoy it and your week end. No emergency. If I've labelled this ticket as "rumination", this is a hint that on my side at least there's no identified funding vehicle to make that happen. |
As minor contributor to QDuckDB but really impressed by the DuckDB project, I'm excited to see its support coming in GDAL!
Exactly what I would expect! |
Glad to see interest in integrating DuckDB into GDAL. Can't wait to see what happens next! |
Sooo... argh, where to begin? I guess ill just start by responding to some of the points brought up:
Potentially, we're experimenting with "parser extensions", but there are already some extensions that do add new syntax, like PRQL and duckdb-pgq. Although these integrate very deeply with DuckDB's unstable c++ internals so I don't know how sustainable doing something similar right now would be in the long-term.
Yes, iirc we at one point attempted to ship a pre-built static binary of the duckdb library to VCPKG, but it was rejected as they did not like us vendoring our own version of third party dependencies. Im not sure if e.g. Debian have similar qualms but I would expect so. One thing that could be interesting for GDAL is to use DuckDB's amalgamation build, which gives you a single huge c++ file + header of the entire duckdb source code (sqlite style). This is how we've dealt with distributing DuckDB in a few of the client libraries (e.g. swift, golang). We also include the parquet extension in the amalgamation build by default, and it should be possible to embed others as well with some modification to the generation script. The downside of this is of course that you may end up needing a pretty beefy machine to actually compile it.
We've made a lot of changes to the way result sets are handled In the C-API and the plan is to replace the current arrow api with something better in the future (that is also more performant/doesnt have to materialize everything). We ended up marking things as deprecated even though there are no replacements yet with the idea that it is better to be early with this sort of stuff, but you're not alone in finding that confusing haha.
I think this is probably the biggest challenge regarding integrating DuckDB in GDAL. There's a couple of different angles to this. One idea is to consider integrating DuckDB only for its parquet capabilities. But even if we ignore the spatial extension, there is an implicit dependency between the two when reading/writing GeoParquet. Basically, the parquet extension will look through the DuckDB catalog to find the Another approach, that I've had in the back of my mind as one of the long-term goals for DuckDB spatial is to remove GDAL from the spatial extension itself, and ship a spatial aware DuckDB driver for GDAL, thus "inverting" the dependency. This is a pretty radical move, (perhaps, equally radical as embedding GDAL in the first place) but I do think it makes a lot of sense.
For users that still want to use GDAL for some of the more obscure formats, we could create a GDAL extension that is separate from spatial, perhaps using DuckDB's new stable C-extension-API, (which leads to MUCH smaller extension binaries and build times) as well as benefit from future build improvements in GDAL. The main things that need to happen for this is:
One question would be how to deal with PROJ/ the proj database, as that would need most likely still need to be embedded in both spatial and the GDAL extension (and maybe will also be embedded in GDAL itself in the future), and it feels a bit silly to have this big 9mb blob duplicated between all of these binaries. But we are currently thinking of different ways to slim down the PROJ database we ship and/or make it configurable/downloadable at runtime anyway - we're starting to get other DuckDB extensions that want to deal with external resource files at runtime. |
Maybe it goes without saying, but I don't want this to reflect negatively on GDAL. Embedding GDAL into spatial was a fantastic way to make the spatial extension really useful from the get-go and generated a lot of attention for us early on. It was only really feasible thanks to the amazing work championed in #5830. But as both DuckDB and spatial has matured the last two years it would be awesome if we could not just benefit from GDAL but also be part of what makes GDAL great! |
I would expect most packagers to actively hate that, even more than the vendored DuckDB build, and just remove the amalgamated DuckDB-within-GDAL, or not enable it, which would defeat the purpose. The packager-friendly solution would be for DuckDB to offer the possibility to link against "system" / external versions of its dependencies, at least for those that are already packaged (there's obvioulsy a fair amount of work to do to be able to link against external libs). Packagers don't like vendored components because that's a nightmare to deal with when there is a security flaw in one of them: they have to identify all software that vendorize it and patch them. |
#11003 adds a ADBC driver that can use libduckdb to read Parquet or DuckDB datasets. Before you get too excited, no spatial support for now :-) but should be doable as follow-up. This will be GDAL 3.11 material, as we are too close to the imminent 3.10 feature freeze |
Just a note that the reverse is also possible (export an ADBC driver init function from GDAL that wraps OGR). I am not sure that would help today because DuckDB I don't believe can federate via ADBC, but in theory it could work quite nicely with one copy of GDAL and one copy of DuckDB (e.g., installed via pip) registered with the Python driver manager (or R, or some application embedding both). |
ADBC driver now merged into GDAL master / 3.11.0dev: https://gdal.org/en/latest/drivers/vector/adbc.html |
I've been fairly impressed by my testing of DuckDB. The speed of their SQL engine is really impressive. And they have a very impressive Parquet reader too (or at least the interaction with it through SQL shines). I don't know if I did something wrong with the Arrow Parquet reader (I don't think so ..?!?), but it is obvious that querying Parquet through DuckDB is much faster than through OGR SQL + Arrow Parquet, I believe especially if the result set is small compared to the table size, despite my attempts at translating OGR SQL to Arrow filtering capabilities, or that might require even further effort to fully leverage its capabilities. I dunno.
What could be very cool, but I don't know if duckdb allows it through the API, would be to have a "DuckDB SQL" dialect, similar to our current "SQLite SQL" dialect, where we could for example expose a OGR layer as an ArrowStream and make DuckDB process the SQL statement, and get back the result (obviously the DuckDB spatial extension is able to do that, using probably some internal API available to extensions).
That said, it's not all roses:
__arrow_c_stream__
to retrieve an ArrowStream and use it on the GDAL C++ side . It's there: rouault@422220b . It works ... as a prototype (note: I didn't do anything spatial in it) ... but my findings are that there might an inefficiency in the way DuckDB maps it "native model" to Arrow, as it seems to fully "materialize" the whole result set and lacks an easy way to specify the "page size", and my impression was even with its huge default paging, which can be an issue when their size exceed available RAM. Apologies if my conclusions are wrong (lots of new source code to figure out...) Whereas using duckdb-the-binary, I've not observed that behavior and when playing with ulimit -v, I could easily get result sets largely exceeding the allocated virtual memory. So in the current state of things, to get nice speed, one should probably implement the native DuckDB result set API, which makes things less appetizing than using the ArrowStream oneI don't have further concrete plans for now. Just a memory dump that might raise interest or dismay :-)
Perhaps this is not a good idea after all, from an "overall ecosystem point of view", although having speedy Parquet SQL in OGR itself is very tempting ...
CC FYI @jorisvandenbossche @Maxxen @dbaston
The text was updated successfully, but these errors were encountered: