-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support reading from in-memory buffers #25
ENH: support reading from in-memory buffers #25
Conversation
@@ -84,19 +85,36 @@ def read( | |||
"geometry": "<geometry type>" | |||
} | |||
""" | |||
from_buffer = False | |||
if isinstance(path_or_buffer, bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am checking for bytes vs strings to determine whether it's a path or in-memory bytes. I don't know if that is robust enough? Or do we want a separate read_buffer
or so?
Alternatively (or in addition), we could also support "file-like" objects (objects that have a read()
method that will return the bytes). That's eg what fiona does in their open()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think file-like objects that have a read()
method would provide a more general solution? For example, the is_zipped
below check is limited to a single zip format, whereas the file-like pattern would let user construct and pass in a ZipFile
or GzipFile
class instance. I'm not familiar with fsspec, but it seems like the file-like pattern would support that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A few notes.
I think we should also support io.BytesIO
objects here. Your snippet from #22 currently doesn't work and you need to call read()
manually before passing it to Pyogrio.
file_bytes = io.BytesIO(open("tst.gpkg", "rb").read())
pyogrio.read_dataframe(file_bytes)
ERROR 4: <_io.BytesIO object at 0x1679997c0>: No such file or directory
---------------------------------------------------------------------------
CPLE_OpenFailedError Traceback (most recent call last)
~/Git/pyogrio/pyogrio/_io.pyx in pyogrio._io.ogr_open()
130 try:
--> 131 ogr_dataset = exc_wrap_pointer(
132 GDALOpenEx(path_c, flags, <const char *const *>ogr_drivers, <const char *const *>open_opts, NULL)
~/Git/pyogrio/pyogrio/_err.pyx in pyogrio._err.exc_wrap_pointer()
175 if exc:
--> 176 raise exc
177 else:
CPLE_OpenFailedError: <_io.BytesIO object at 0x1679997c0>: No such file or directory
During handling of the above exception, another exception occurred:
DriverError Traceback (most recent call last)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_86671/3402334228.py in <module>
1 file_bytes = io.BytesIO(open(momepy.datasets.get_path("bubenec"), "rb").read())
----> 2 pyogrio.read_dataframe(file_bytes)
~/Git/pyogrio/pyogrio/geopandas.py in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids)
88 raise ValueError(f"'{path}' does not exist")
89
---> 90 meta, geometry, field_data = read(
91 path_or_buffer,
92 layer=layer,
~/Git/pyogrio/pyogrio/raw.py in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids)
108
109 try:
--> 110 result = ogr_read(
111 path,
112 layer=layer,
~/Git/pyogrio/pyogrio/_io.pyx in pyogrio._io.ogr_read()
717 fids = np.asarray(fids, dtype=np.intc)
718
--> 719 ogr_dataset = ogr_open(path_c, 0, kwargs)
720 ogr_layer = get_ogr_layer(ogr_dataset, layer)
721
~/Git/pyogrio/pyogrio/_io.pyx in pyogrio._io.ogr_open()
138
139 except CPLE_BaseError as exc:
--> 140 raise DriverError(str(exc))
141
142 finally:
DriverError: <_io.BytesIO object at 0x1679997c0>: No such file or directory
While this works.
file_bytes = io.BytesIO(open(momepy.datasets.get_path("bubenec"), "rb").read())
pyogrio.read_dataframe(file_bytes.read())
Also, whenever I read from bytes, I get this kind of a warning from GDAL. I that okay? shall we try to silence/resolve it?
Warning 1: File /vsimem/a325e080edbc40e1b6e3404c586fe6b7 has GPKG application_id, but non conformant file extension
Yes, I also mentioned it above at #25 (comment). That indeed seems logical to add as well (this is just not strictly needed for geopandas, because we already call
I just noticed that with fiona as well. In this case, that's indeed something we should try to silence (would need to look into that how this part works) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @jorisvandenbossche !
Overall this looks good, and I think the only major decision points are around whether the first parameter to read
should be position-only, and whether or not to use file-like inputs in addition to or in place of raw byte buffers. My hunch is that in place of would be reasonable, since user can always wrap a raw buffer in io.BytesIO
pyogrio/geopandas.py
Outdated
if not "://" in path: | ||
if not "/vsi" in path.lower() and not os.path.exists(path): | ||
raise ValueError(f"'{path}' does not exist") | ||
if isinstance(path_or_buffer, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will want a similar check for pathlib.Path
objects too (which we achieved previously by forcing everything to string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check about the path existing (if a local file), is there a reason this lives specifically in geopandas.py
, or we could also move it to read_raw
, and here just pass through the path_or_buffer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this could move to read_raw
; no reason it specifically lives here.
@@ -84,19 +85,36 @@ def read( | |||
"geometry": "<geometry type>" | |||
} | |||
""" | |||
from_buffer = False | |||
if isinstance(path_or_buffer, bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think file-like objects that have a read()
method would provide a more general solution? For example, the is_zipped
below check is limited to a single zip format, whereas the file-like pattern would let user construct and pass in a ZipFile
or GzipFile
class instance. I'm not familiar with fsspec, but it seems like the file-like pattern would support that as well.
pyogrio/tests/test_raw_io.py
Outdated
assert_equal_result((meta, geometry, field_data), (meta2, geometry2, field_data2)) | ||
|
||
|
||
filename = os.path.join(str(tmpdir), "test.geojson") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though there is some shared code, I'd suggest simplifying this a little bit and using pytest.mark.parametrize
with varying driver
For the "raw buffer vs file-like object with read()" support, there is also the idea at #22 (comment) for having a way to not call |
Can we try to catch those GDAL warnings?
|
Ah, yes, so I started looking into that a while ago, and it's not that straightforward to "properly" get rid of them (i.e. by setting an appropriate file path extension). Note that fiona has the same issue (so from geopandas' point of view, it wouldn't be a "regression") The warning it self is not an actual python warning that I can catch and silence, but something that gets printed by GDAL? Lines 135 to 165 in d0b202a
that could maybe be expanded to raise python warnings (or suppress them) as well ( |
I'm not sure what the best path forward on warnings vs logging is here, but I don't want to hold this PR up. Perhaps we can push dealing with that to another issue, so that this can get merged (after resolving conflicts)? |
That is okay with me but I'd like to resolve that before then next release if possible. |
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing to work on this @jorisvandenbossche
A few suggested changes, and then this looks ready.
pyogrio/_ogr.pyx
Outdated
|
||
vsi_filename = '/vsimem/{}'.format(uuid4().hex + ext) | ||
|
||
vsi_handle = VSIFileFromMemBuffer(vsi_filename.encode("utf8"), <unsigned char *>bytesbuf, len(bytesbuf), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere we usually handle Python => C strings in multiple steps, I thought in part because not doing so triggers a compilation error. And we've standarded on using "UTF-8"
to refer to unicode in Cython.
So this would be
char *filename_c = NULL
filename_b = vsi_filename.encode("UTF-8")
filename_c = filename_b
vsi_handle = VSIFileFromMemBuffer(filename_c, <unsigned char *>bytesbuf, len(bytesbuf), 0)
Though I'm not sure that is strictly necessary. (same for remove_virtual_filename
too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to be necessary, since it is compiling here?
(but already changed utf8 to UTF-8)
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche !
pyogrio/tests/test_raw_io.py
Outdated
|
||
assert np.array_equal(meta1["fields"], meta2["fields"]) | ||
assert np.array_equal(index1, index2) | ||
# assert np.array_equal(geometry1, geometry2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: remove commented line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turned it into a small explanation why we are using pygeos here
Draft experiment for the reading side of #22