Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to read_dataframe to use sql statement #70

Merged
merged 36 commits into from
Apr 25, 2022

Conversation

theroggy
Copy link
Member

@theroggy theroggy commented Apr 14, 2022

I like to use SQL... but I don't know of an elegant/efficient way to get the result from an SQL statement on a geopackage to a GeoDataFrame at the moment. As a workaround, I execute the sql statement with gdal.VectorTranslate() (~python binding to ogr2ogr) to a temp file and then I read the result with geopandas.read_file(). It works, but it feels slightly suboptimal :-).

Because GDAL supports executing sql statements it seemed like an interesting and possibly realistic option to built in support for it in pyogrio.

To explore the idea I had a better look at the pyogrio code... and it seemed easier than I anticipated, so I immediately went for a POC. However, I don't have any experience in/knowledge about writing cython... so it's pure copy/paste from the code around it, so I hope it doesn't look too ugly :-(.
I added a TODO in ogr_read in a spot where I think it would be safer to use a try/finally to be sure everything gets cleaned up... but because it isn't there now I wonder if there is a good reason for doing it the way it is now?

If you can spare the time it would be great if you could have a look...

Remark: as it is a POC, I didn't make any changes to inline documentation yet.

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @theroggy ! This is a good addition and may make it eaier for pyogrio to stand in as a faster substitute for read_sql in GeoPandas (though not a drop-in replacement).

Overall the Cython part looks pretty good, and most of my comments were focused on the tests.

It makes me think we should expand our test fixtures to include a tiny geopackage with a few layers, so that we can test interaction with sql and where since there are known issues with GPKG behaving differently on invalid SQL. I added #73 for that, no need to address that in this PR.

Given that this is a good addition in its current form, please go ahead and add this to the documentation. In addition to examples, which can largely be drawn from your test cases, we'll want to make sure to address incompatible parameter combinations, mention that you don't specify the geometry column when subsetting columns using the default dialect, and other SQL-specific gotchas (or a link to the GDAL pages for such).

pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
theroggy and others added 6 commits April 17, 2022 23:11
fgb not possible yet, because the test file contains a combination of polygons and multipolygons
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
@theroggy theroggy marked this pull request as draft April 18, 2022 13:27
@theroggy theroggy marked this pull request as ready for review April 20, 2022 16:26
Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your continued work on this @theroggy , this is shaping up nicely!

Looks like you found it possible to combine sql with some of the other parameters (bbox, etc) that get applied after the SQL query. Do you know if there are any performance implications lurking there that we should state in docstring / docs? E.g., if you have a massive dataset that leverages a spatial index in its database, that using SQL followed by bbox may not be able to leverage that (not sure, just speculating).

docs/source/introduction.md Outdated Show resolved Hide resolved
docs/source/introduction.md Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Show resolved Hide resolved
pyogrio/_io.pyx Show resolved Hide resolved
docs/source/introduction.md Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
theroggy and others added 3 commits April 21, 2022 14:47
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
@theroggy theroggy marked this pull request as draft April 21, 2022 17:13
@theroggy theroggy marked this pull request as ready for review April 21, 2022 20:19
@theroggy
Copy link
Member Author

Thanks for your continued work on this @theroggy , this is shaping up nicely!

Looks like you found it possible to combine sql with some of the other parameters (bbox, etc) that get applied after the SQL query. Do you know if there are any performance implications lurking there that we should state in docstring / docs? E.g., if you have a massive dataset that leverages a spatial index in its database, that using SQL followed by bbox may not be able to leverage that (not sure, just speculating).

I did some performance tests on how the combination of SQL + bbox behaves, and added the conclusions to the documentation. Also did some extra textual changes in general as well + added some more links to other relevant documentation.

@theroggy
Copy link
Member Author

Hmm... it's possible that spatialite is not available on the "Docker GDAL" test images?

@brendan-ward
Copy link
Member

spatialite is not available on the "Docker GDAL" test images

Indeed. We're using the small image, and it looks like spatialite is only enabled for the full image

Pretty big difference in base sizes, though on CI this may only impact the transfer / startup time of the container.

A couple thoughts:

  • we could switch to the full container here but would want to see if that negatively impacts test execution time
  • we could add the full container for latest released GDAL, and then toggle that test based on something (not sure: environment variable? feature detection? something else?)

@theroggy theroggy marked this pull request as draft April 22, 2022 22:31
@theroggy
Copy link
Member Author

I found a way to check if spatialite is available in sqlite... so the test is now skipped if it is not.

@theroggy theroggy marked this pull request as ready for review April 22, 2022 22:40
Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @theroggy , thanks for the updates.

Please also add to the changelog.

A few minor comments, and suggest splitting a test case for better skip-ability.

docs/source/introduction.md Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
assert df.iloc[0].iso_a3 == "CAN"
area_canada = df.iloc[0].geometry.area

# Use spatialite function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest splitting this into a separate test that can be executed or skipped in whole, and then limit that test to GPKG inputs.

Copy link
Member Author

@theroggy theroggy Apr 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now remembered a way to get this working anyway. GDAL is actually always built with spatialite support as far as I know, but for some odd reason spatialite is not activated in sqlite on those images. So as long as you don't run such queries on non-native-sqlite files, the spatialite used will be the one "inside" GDAL. If you run such a query on a GPKG, by default GDAL will just pass the sql through to sqlite... and if spatialite isn't activated there, it doesn't work.

However, you can force gdal not to pass through the SQLITE query by specifying "INDIRECT_SQLITE" as sql_dialect, and then it works, even though performance might be impacted.

I'll change the documentation accordingly, but regarding the tests: do you still like the tests separated in this case? When it is detected that spatialite is not available within sqlite, the won't be skipped anymore but the sqlite_dialect will be changed to INDIRECT_SQLITE.

PS: I wonder if it would be a good idea to request for the gdal-ubuntu-small image that spatialite would be activated within sqlite as well, as it is a bit odd that the sqlite/geopackage driver is available, but only in a "crippled" version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you still like the tests separated in this case

Yes, I think smaller focused test cases are better.

When it is detected that spatialite is not available within sqlite, the won't be skipped anymore but the sqlite_dialect will be changed to INDIRECT_SQLITE

This is a good solution, thank you for the additional investigation.

request for the gdal-ubuntu-small image that spatialite would be activated within sqlite as well

Certainly could be raised as an issue in the GDAL repo; I'm not sure of the implications for image size / complexity / license (could well be the latter).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I kept it in a separate test...

theroggy and others added 6 commits April 23, 2022 10:27
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
@theroggy theroggy marked this pull request as draft April 23, 2022 09:24
@theroggy theroggy marked this pull request as ready for review April 25, 2022 16:38
Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates here @theroggy

A couple very minor comments, then this is good to go.

CHANGES.md Outdated Show resolved Hide resolved
docs/source/introduction.md Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Show resolved Hide resolved
Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @theroggy 🚀

@brendan-ward brendan-ward merged commit cd5b464 into geopandas:main Apr 25, 2022
@theroggy theroggy deleted the Feature-execute_sql branch April 25, 2022 19:51
@theroggy
Copy link
Member Author

For the issue about spatialite not available in image osgeo/gdal:ubuntu-small-latest, the following issue was opened:
OSGeo/gdal#5657

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants