Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support reading from in-memory buffers #25

Merged
merged 20 commits into from
Apr 2, 2022

Conversation

jorisvandenbossche
Copy link
Member

Draft experiment for the reading side of #22

pyogrio/_ogr.pxd Outdated Show resolved Hide resolved
pyogrio/_ogr.pyx Show resolved Hide resolved
pyogrio/raw.py Show resolved Hide resolved
pyogrio/raw.py Show resolved Hide resolved
@@ -84,19 +85,36 @@ def read(
"geometry": "<geometry type>"
}
"""
from_buffer = False
if isinstance(path_or_buffer, bytes):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am checking for bytes vs strings to determine whether it's a path or in-memory bytes. I don't know if that is robust enough? Or do we want a separate read_buffer or so?
Alternatively (or in addition), we could also support "file-like" objects (objects that have a read() method that will return the bytes). That's eg what fiona does in their open() method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think file-like objects that have a read() method would provide a more general solution? For example, the is_zipped below check is limited to a single zip format, whereas the file-like pattern would let user construct and pass in a ZipFile or GzipFile class instance. I'm not familiar with fsspec, but it seems like the file-like pattern would support that as well.

pyogrio/tests/test_raw_io.py Show resolved Hide resolved
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review January 27, 2022 11:21
Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few notes.

I think we should also support io.BytesIO objects here. Your snippet from #22 currently doesn't work and you need to call read() manually before passing it to Pyogrio.

file_bytes = io.BytesIO(open("tst.gpkg", "rb").read())
pyogrio.read_dataframe(file_bytes)

ERROR 4: <_io.BytesIO object at 0x1679997c0>: No such file or directory
---------------------------------------------------------------------------
CPLE_OpenFailedError                      Traceback (most recent call last)
~/Git/pyogrio/pyogrio/_io.pyx in pyogrio._io.ogr_open()
    130     try:
--> 131         ogr_dataset = exc_wrap_pointer(
    132             GDALOpenEx(path_c, flags, <const char *const *>ogr_drivers, <const char *const *>open_opts, NULL)

~/Git/pyogrio/pyogrio/_err.pyx in pyogrio._err.exc_wrap_pointer()
    175         if exc:
--> 176             raise exc
    177         else:

CPLE_OpenFailedError: <_io.BytesIO object at 0x1679997c0>: No such file or directory

During handling of the above exception, another exception occurred:

DriverError                               Traceback (most recent call last)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_86671/3402334228.py in <module>
      1 file_bytes = io.BytesIO(open(momepy.datasets.get_path("bubenec"), "rb").read())
----> 2 pyogrio.read_dataframe(file_bytes)

~/Git/pyogrio/pyogrio/geopandas.py in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids)
     88                 raise ValueError(f"'{path}' does not exist")
     89 
---> 90     meta, geometry, field_data = read(
     91         path_or_buffer,
     92         layer=layer,

~/Git/pyogrio/pyogrio/raw.py in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids)
    108 
    109     try:
--> 110         result = ogr_read(
    111             path,
    112             layer=layer,

~/Git/pyogrio/pyogrio/_io.pyx in pyogrio._io.ogr_read()
    717         fids = np.asarray(fids, dtype=np.intc)
    718 
--> 719     ogr_dataset = ogr_open(path_c, 0, kwargs)
    720     ogr_layer = get_ogr_layer(ogr_dataset, layer)
    721 

~/Git/pyogrio/pyogrio/_io.pyx in pyogrio._io.ogr_open()
    138 
    139     except CPLE_BaseError as exc:
--> 140         raise DriverError(str(exc))
    141 
    142     finally:

DriverError: <_io.BytesIO object at 0x1679997c0>: No such file or directory

While this works.

file_bytes = io.BytesIO(open(momepy.datasets.get_path("bubenec"), "rb").read())
pyogrio.read_dataframe(file_bytes.read())

Also, whenever I read from bytes, I get this kind of a warning from GDAL. I that okay? shall we try to silence/resolve it?

Warning 1: File /vsimem/a325e080edbc40e1b6e3404c586fe6b7 has GPKG application_id, but non conformant file extension

pyogrio/geopandas.py Show resolved Hide resolved
pyogrio/tests/test_raw_io.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

I think we should also support io.BytesIO objects here. Your snippet from #22 currently doesn't work and you need to call read() manually before passing it to Pyogrio.

Yes, I also mentioned it above at #25 (comment). That indeed seems logical to add as well (this is just not strictly needed for geopandas, because we already call .read() on file-like objects before passing that to fiona/pyogrio)

Also, whenever I read from bytes, I get this kind of a warning from GDAL. I that okay? shall we try to silence/resolve it?

I just noticed that with fiona as well. In this case, that's indeed something we should try to silence (would need to look into that how this part works)

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @jorisvandenbossche !

Overall this looks good, and I think the only major decision points are around whether the first parameter to read should be position-only, and whether or not to use file-like inputs in addition to or in place of raw byte buffers. My hunch is that in place of would be reasonable, since user can always wrap a raw buffer in io.BytesIO

pyogrio/geopandas.py Show resolved Hide resolved
if not "://" in path:
if not "/vsi" in path.lower() and not os.path.exists(path):
raise ValueError(f"'{path}' does not exist")
if isinstance(path_or_buffer, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will want a similar check for pathlib.Path objects too (which we achieved previously by forcing everything to string).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check about the path existing (if a local file), is there a reason this lives specifically in geopandas.py, or we could also move it to read_raw, and here just pass through the path_or_buffer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this could move to read_raw; no reason it specifically lives here.

@@ -84,19 +85,36 @@ def read(
"geometry": "<geometry type>"
}
"""
from_buffer = False
if isinstance(path_or_buffer, bytes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think file-like objects that have a read() method would provide a more general solution? For example, the is_zipped below check is limited to a single zip format, whereas the file-like pattern would let user construct and pass in a ZipFile or GzipFile class instance. I'm not familiar with fsspec, but it seems like the file-like pattern would support that as well.

pyogrio/_ogr.pxd Outdated Show resolved Hide resolved
pyogrio/raw.py Show resolved Hide resolved
assert_equal_result((meta, geometry, field_data), (meta2, geometry2, field_data2))


filename = os.path.join(str(tmpdir), "test.geojson")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though there is some shared code, I'd suggest simplifying this a little bit and using pytest.mark.parametrize with varying driver

@jorisvandenbossche
Copy link
Member Author

For the "raw buffer vs file-like object with read()" support, there is also the idea at #22 (comment) for having a way to not call read() up-front for such file like objects. For remote files, that could be beneficial (eg if you now have an open file backed by fsspec).

@brendan-ward brendan-ward added this to the Version 0.4.0 milestone Jan 28, 2022
@martinfleis
Copy link
Member

Can we try to catch those GDAL warnings?

Warning 1: File [/vsimem/52fd6590f4fb4d77b5e5a3d84f748abc]() has GPKG application_id, but non conformant file extension

@jorisvandenbossche
Copy link
Member Author

Ah, yes, so I started looking into that a while ago, and it's not that straightforward to "properly" get rid of them (i.e. by setting an appropriate file path extension).
In theory we could get the extension from a drive name (through the GDAL API to get driver metadata), but in that case the user would still need to provide a driver name just for setting the file path extension (while for the actual reading this gets inferred).

Note that fiona has the same issue (so from geopandas' point of view, it wouldn't be a "regression")

The warning it self is not an actual python warning that I can catch and silence, but something that gets printed by GDAL?
We currently check for errors at

pyogrio/pyogrio/_err.pyx

Lines 135 to 165 in d0b202a

cdef inline object exc_check():
"""Checks GDAL error stack for fatal or non-fatal errors
Returns
-------
An Exception, SystemExit, or None
"""
cdef const char *msg_c = NULL
err_type = CPLGetLastErrorType()
err_no = CPLGetLastErrorNo()
err_msg = CPLGetLastErrorMsg()
if err_msg == NULL:
msg = "No error message."
else:
# Reformat messages.
msg_b = err_msg
msg = msg_b.decode('utf-8')
msg = msg.replace("`", "'")
msg = msg.replace("\n", " ")
if err_type == 3:
CPLErrorReset()
return exception_map.get(
err_no, CPLE_BaseError)(err_type, err_no, msg)
if err_type == 4:
return SystemExit("Fatal error: {0}".format((err_type, err_no, msg)))
else:
return

that could maybe be expanded to raise python warnings (or suppress them) as well (err_type 1 is Debug and 2 is Warning, while we currently only handle 3 (Failure) and 4 (Fatal)). Although we also have a logging set up, so that should maybe be used for this instead of python warnings? (@brendan-ward)

@brendan-ward
Copy link
Member

I'm not sure what the best path forward on warnings vs logging is here, but I don't want to hold this PR up. Perhaps we can push dealing with that to another issue, so that this can get merged (after resolving conflicts)?

@martinfleis
Copy link
Member

Perhaps we can push dealing with that to another issue, so that this can get merged

That is okay with me but I'd like to resolve that before then next release if possible.

pyogrio/_ogr.pyx Outdated Show resolved Hide resolved
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to work on this @jorisvandenbossche

A few suggested changes, and then this looks ready.

pyogrio/_ogr.pyx Outdated Show resolved Hide resolved
pyogrio/_ogr.pyx Outdated

vsi_filename = '/vsimem/{}'.format(uuid4().hex + ext)

vsi_handle = VSIFileFromMemBuffer(vsi_filename.encode("utf8"), <unsigned char *>bytesbuf, len(bytesbuf), 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere we usually handle Python => C strings in multiple steps, I thought in part because not doing so triggers a compilation error. And we've standarded on using "UTF-8" to refer to unicode in Cython.

So this would be

char *filename_c = NULL
filename_b = vsi_filename.encode("UTF-8")
filename_c = filename_b
vsi_handle = VSIFileFromMemBuffer(filename_c, <unsigned char *>bytesbuf, len(bytesbuf), 0)

Though I'm not sure that is strictly necessary. (same for remove_virtual_filename too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to be necessary, since it is compiling here?

(but already changed utf8 to UTF-8)

pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Show resolved Hide resolved
pyogrio/raw.py Show resolved Hide resolved
Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


assert np.array_equal(meta1["fields"], meta2["fields"])
assert np.array_equal(index1, index2)
# assert np.array_equal(geometry1, geometry2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: remove commented line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned it into a small explanation why we are using pygeos here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants