Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: refactor handling of reading from in-memory dataset #407

Merged
merged 3 commits into from
May 4, 2024

Conversation

brendan-ward
Copy link
Member

Resolves #401

This moves all handling of creation / destruction of the in-memory dataset used for read operations to the Cython tier, and now passes in either a string or bytes to read_info(), list_layers(), read_bounds(), read(), read_arrow() functions. I think this also sets us to for later refactors where we may pass file-like / filesystem-like objects directly down to Cython in order to use the GDAL virtual filesystem plugin.

To keep things organized, I split VSI related Cython functions into the new _vsi.pyx / _vsi.pxd files.

Because the bytes buffer is passed as a parameter, it remains in scope during the Cython function and we don't need to hold an extra handle on it like before. (the handle was required to prevent Python from deallocating it while GDAL still may use it to represent the in-memory dataset).

This expands the tests to verify that core functions still work correctly when passed bytes or file-like objects. To avoid the various GPKG related errors when working with this type in memory, I instead added new test fixtures to create a GeoJSON file from the first 3 records of the naturalearth_lowres dataset.

This also adds a specific check that the incoming path does not already contain /vsimem/, since we have to handle that internally and cannot allow it to be passed in.

@brendan-ward brendan-ward added this to the 0.8.0 milestone May 3, 2024
@brendan-ward brendan-ward added the enhancement New feature or request label May 3, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks a lot

pyogrio/_vsi.pyx Outdated Show resolved Hide resolved
pyogrio/_vsi.pyx Outdated Show resolved Hide resolved
pyogrio/tests/conftest.py Outdated Show resolved Hide resolved
pyogrio/tests/test_arrow.py Outdated Show resolved Hide resolved
@@ -817,7 +847,7 @@ def test_write_memory(naturalearth_lowres, driver):

buffer = BytesIO()
write_arrow(
table.slice(0, 1),
table,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I did this was to avoid the filterwarning you added .. ;) Ideally we would add a test dataset without mixed geometries that we can use for most tests .. (or just edit the naturalearth fixture to give all MultiPolygons), but that's for another issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched the places where we were doing .slice(0, 1) to only read 1 feature from naturalearth, and sidestep the issue (and also still allow comparison of first table to the written & reread table).

@brendan-ward brendan-ward merged commit 893f955 into main May 4, 2024
47 checks passed
@brendan-ward brendan-ward deleted the issue401 branch May 4, 2024 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_arrow does cannot read from bytes / buffer / file-like
2 participants