-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to read_dataframe to use sql statement #70
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @theroggy ! This is a good addition and may make it eaier for pyogrio to stand in as a faster substitute for read_sql
in GeoPandas (though not a drop-in replacement).
Overall the Cython part looks pretty good, and most of my comments were focused on the tests.
It makes me think we should expand our test fixtures to include a tiny geopackage with a few layers, so that we can test interaction with sql
and where
since there are known issues with GPKG behaving differently on invalid SQL. I added #73 for that, no need to address that in this PR.
Given that this is a good addition in its current form, please go ahead and add this to the documentation. In addition to examples, which can largely be drawn from your test cases, we'll want to make sure to address incompatible parameter combinations, mention that you don't specify the geometry column when subsetting columns using the default dialect, and other SQL-specific gotchas (or a link to the GDAL pages for such).
fgb not possible yet, because the test file contains a combination of polygons and multipolygons
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
This reverts commit e583c00.
It works, so can be allowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your continued work on this @theroggy , this is shaping up nicely!
Looks like you found it possible to combine sql
with some of the other parameters (bbox, etc) that get applied after the SQL query. Do you know if there are any performance implications lurking there that we should state in docstring / docs? E.g., if you have a massive dataset that leverages a spatial index in its database, that using SQL followed by bbox
may not be able to leverage that (not sure, just speculating).
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
I did some performance tests on how the combination of SQL + bbox behaves, and added the conclusions to the documentation. Also did some extra textual changes in general as well + added some more links to other relevant documentation. |
Hmm... it's possible that spatialite is not available on the "Docker GDAL" test images? |
Indeed. We're using the small image, and it looks like spatialite is only enabled for the full image Pretty big difference in base sizes, though on CI this may only impact the transfer / startup time of the container. A couple thoughts:
|
I found a way to check if spatialite is available in sqlite... so the test is now skipped if it is not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great @theroggy , thanks for the updates.
Please also add to the changelog.
A few minor comments, and suggest splitting a test case for better skip-ability.
assert df.iloc[0].iso_a3 == "CAN" | ||
area_canada = df.iloc[0].geometry.area | ||
|
||
# Use spatialite function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest splitting this into a separate test that can be executed or skipped in whole, and then limit that test to GPKG inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now remembered a way to get this working anyway. GDAL is actually always built with spatialite support as far as I know, but for some odd reason spatialite is not activated in sqlite on those images. So as long as you don't run such queries on non-native-sqlite files, the spatialite used will be the one "inside" GDAL. If you run such a query on a GPKG, by default GDAL will just pass the sql through to sqlite... and if spatialite isn't activated there, it doesn't work.
However, you can force gdal not to pass through the SQLITE query by specifying "INDIRECT_SQLITE" as sql_dialect, and then it works, even though performance might be impacted.
I'll change the documentation accordingly, but regarding the tests: do you still like the tests separated in this case? When it is detected that spatialite is not available within sqlite, the won't be skipped anymore but the sqlite_dialect will be changed to INDIRECT_SQLITE.
PS: I wonder if it would be a good idea to request for the gdal-ubuntu-small image that spatialite would be activated within sqlite as well, as it is a bit odd that the sqlite/geopackage driver is available, but only in a "crippled" version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you still like the tests separated in this case
Yes, I think smaller focused test cases are better.
When it is detected that spatialite is not available within sqlite, the won't be skipped anymore but the sqlite_dialect will be changed to INDIRECT_SQLITE
This is a good solution, thank you for the additional investigation.
request for the gdal-ubuntu-small image that spatialite would be activated within sqlite as well
Certainly could be raised as an issue in the GDAL repo; I'm not sure of the implications for image size / complexity / license (could well be the latter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I kept it in a separate test...
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the updates here @theroggy
A couple very minor comments, then this is good to go.
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
…grio into Feature-execute_sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @theroggy 🚀
For the issue about spatialite not available in image osgeo/gdal:ubuntu-small-latest, the following issue was opened: |
I like to use SQL... but I don't know of an elegant/efficient way to get the result from an SQL statement on a geopackage to a GeoDataFrame at the moment. As a workaround, I execute the sql statement with gdal.VectorTranslate() (~python binding to ogr2ogr) to a temp file and then I read the result with geopandas.read_file(). It works, but it feels slightly suboptimal :-).
Because GDAL supports executing sql statements it seemed like an interesting and possibly realistic option to built in support for it in pyogrio.
To explore the idea I had a better look at the pyogrio code... and it seemed easier than I anticipated, so I immediately went for a POC. However, I don't have any experience in/knowledge about writing cython... so it's pure copy/paste from the code around it, so I hope it doesn't look too ugly :-(.
I added a TODO in ogr_read in a spot where I think it would be safer to use a try/finally to be sure everything gets cleaned up... but because it isn't there now I wonder if there is a good reason for doing it the way it is now?
If you can spare the time it would be great if you could have a look...
Remark: as it is a POC, I didn't make any changes to inline documentation yet.