ENH: support reading from in-memory buffers #25

jorisvandenbossche · 2021-11-12T10:56:56Z

Draft experiment for the reading side of #22

pyogrio/_ogr.pxd

pyogrio/_ogr.pyx

pyogrio/raw.py

jorisvandenbossche · 2021-11-12T11:06:12Z

pyogrio/raw.py

@@ -84,19 +85,36 @@ def read(
            "geometry": "<geometry type>"
        }
    """
+    from_buffer = False
+    if isinstance(path_or_buffer, bytes):


Here I am checking for bytes vs strings to determine whether it's a path or in-memory bytes. I don't know if that is robust enough? Or do we want a separate read_buffer or so?
Alternatively (or in addition), we could also support "file-like" objects (objects that have a read() method that will return the bytes). That's eg what fiona does in their open() method.

I think file-like objects that have a read() method would provide a more general solution? For example, the is_zipped below check is limited to a single zip format, whereas the file-like pattern would let user construct and pass in a ZipFile or GzipFile class instance. I'm not familiar with fsspec, but it seems like the file-like pattern would support that as well.

pyogrio/tests/test_raw_io.py

martinfleis

Thanks! A few notes.

I think we should also support io.BytesIO objects here. Your snippet from #22 currently doesn't work and you need to call read() manually before passing it to Pyogrio.

file_bytes = io.BytesIO(open("tst.gpkg", "rb").read())
pyogrio.read_dataframe(file_bytes)

ERROR 4: <_io.BytesIO object at 0x1679997c0>: No such file or directory
---------------------------------------------------------------------------
CPLE_OpenFailedError                      Traceback (most recent call last)
~/Git/pyogrio/pyogrio/_io.pyx in pyogrio._io.ogr_open()
    130     try:
--> 131         ogr_dataset = exc_wrap_pointer(
    132             GDALOpenEx(path_c, flags, <const char *const *>ogr_drivers, <const char *const *>open_opts, NULL)

~/Git/pyogrio/pyogrio/_err.pyx in pyogrio._err.exc_wrap_pointer()
    175         if exc:
--> 176             raise exc
    177         else:

CPLE_OpenFailedError: <_io.BytesIO object at 0x1679997c0>: No such file or directory

During handling of the above exception, another exception occurred:

DriverError                               Traceback (most recent call last)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_86671/3402334228.py in <module>
      1 file_bytes = io.BytesIO(open(momepy.datasets.get_path("bubenec"), "rb").read())
----> 2 pyogrio.read_dataframe(file_bytes)

~/Git/pyogrio/pyogrio/geopandas.py in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids)
     88                 raise ValueError(f"'{path}' does not exist")
     89 
---> 90     meta, geometry, field_data = read(
     91         path_or_buffer,
     92         layer=layer,

~/Git/pyogrio/pyogrio/raw.py in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids)
    108 
    109     try:
--> 110         result = ogr_read(
    111             path,
    112             layer=layer,

~/Git/pyogrio/pyogrio/_io.pyx in pyogrio._io.ogr_read()
    717         fids = np.asarray(fids, dtype=np.intc)
    718 
--> 719     ogr_dataset = ogr_open(path_c, 0, kwargs)
    720     ogr_layer = get_ogr_layer(ogr_dataset, layer)
    721 

~/Git/pyogrio/pyogrio/_io.pyx in pyogrio._io.ogr_open()
    138 
    139     except CPLE_BaseError as exc:
--> 140         raise DriverError(str(exc))
    141 
    142     finally:

DriverError: <_io.BytesIO object at 0x1679997c0>: No such file or directory

While this works.

file_bytes = io.BytesIO(open(momepy.datasets.get_path("bubenec"), "rb").read())
pyogrio.read_dataframe(file_bytes.read())

Also, whenever I read from bytes, I get this kind of a warning from GDAL. I that okay? shall we try to silence/resolve it?

Warning 1: File /vsimem/a325e080edbc40e1b6e3404c586fe6b7 has GPKG application_id, but non conformant file extension

pyogrio/geopandas.py

pyogrio/tests/test_raw_io.py

jorisvandenbossche · 2022-01-27T13:53:17Z

I think we should also support io.BytesIO objects here. Your snippet from #22 currently doesn't work and you need to call read() manually before passing it to Pyogrio.

Yes, I also mentioned it above at #25 (comment). That indeed seems logical to add as well (this is just not strictly needed for geopandas, because we already call .read() on file-like objects before passing that to fiona/pyogrio)

Also, whenever I read from bytes, I get this kind of a warning from GDAL. I that okay? shall we try to silence/resolve it?

I just noticed that with fiona as well. In this case, that's indeed something we should try to silence (would need to look into that how this part works)

brendan-ward

Thanks for working on this @jorisvandenbossche !

Overall this looks good, and I think the only major decision points are around whether the first parameter to read should be position-only, and whether or not to use file-like inputs in addition to or in place of raw byte buffers. My hunch is that in place of would be reasonable, since user can always wrap a raw buffer in io.BytesIO

pyogrio/geopandas.py

brendan-ward · 2022-01-27T22:30:06Z

pyogrio/geopandas.py

-    if not "://" in path:
-        if not "/vsi" in path.lower() and not os.path.exists(path):
-            raise ValueError(f"'{path}' does not exist")
+    if isinstance(path_or_buffer, str):


We will want a similar check for pathlib.Path objects too (which we achieved previously by forcing everything to string).

This check about the path existing (if a local file), is there a reason this lives specifically in geopandas.py, or we could also move it to read_raw, and here just pass through the path_or_buffer?

Yes, this could move to read_raw; no reason it specifically lives here.

brendan-ward · 2022-01-27T22:41:42Z

pyogrio/raw.py

@@ -84,19 +85,36 @@ def read(
            "geometry": "<geometry type>"
        }
    """
+    from_buffer = False
+    if isinstance(path_or_buffer, bytes):


I think file-like objects that have a read() method would provide a more general solution? For example, the is_zipped below check is limited to a single zip format, whereas the file-like pattern would let user construct and pass in a ZipFile or GzipFile class instance. I'm not familiar with fsspec, but it seems like the file-like pattern would support that as well.

pyogrio/_ogr.pxd

pyogrio/raw.py

brendan-ward · 2022-01-27T22:54:24Z

pyogrio/tests/test_raw_io.py

+    assert_equal_result((meta, geometry, field_data), (meta2, geometry2, field_data2))
+
+
+    filename = os.path.join(str(tmpdir), "test.geojson")


Even though there is some shared code, I'd suggest simplifying this a little bit and using pytest.mark.parametrize with varying driver

jorisvandenbossche · 2022-01-28T17:18:49Z

For the "raw buffer vs file-like object with read()" support, there is also the idea at #22 (comment) for having a way to not call read() up-front for such file like objects. For remote files, that could be beneficial (eg if you now have an open file backed by fsspec).

martinfleis · 2022-02-25T09:34:08Z

Can we try to catch those GDAL warnings?

Warning 1: File [/vsimem/52fd6590f4fb4d77b5e5a3d84f748abc]() has GPKG application_id, but non conformant file extension

jorisvandenbossche · 2022-02-25T12:36:47Z

Ah, yes, so I started looking into that a while ago, and it's not that straightforward to "properly" get rid of them (i.e. by setting an appropriate file path extension).
In theory we could get the extension from a drive name (through the GDAL API to get driver metadata), but in that case the user would still need to provide a driver name just for setting the file path extension (while for the actual reading this gets inferred).

Note that fiona has the same issue (so from geopandas' point of view, it wouldn't be a "regression")

The warning it self is not an actual python warning that I can catch and silence, but something that gets printed by GDAL?
We currently check for errors at

pyogrio/pyogrio/_err.pyx

Lines 135 to 165 in d0b202a

    
           cdef inline object exc_check(): 
        
               """Checks GDAL error stack for fatal or non-fatal errors 
        
               Returns 
        
               ------- 
        
               An Exception, SystemExit, or None 
        
               """ 
        
               cdef const char *msg_c = NULL 
        
               err_type = CPLGetLastErrorType() 
        
               err_no = CPLGetLastErrorNo() 
        
               err_msg = CPLGetLastErrorMsg() 
        
               if err_msg == NULL: 
        
                   msg = "No error message." 
        
               else: 
        
                   # Reformat messages. 
        
                   msg_b = err_msg 
        
                   msg = msg_b.decode('utf-8') 
        
                   msg = msg.replace("`", "'") 
        
                   msg = msg.replace("\n", " ") 
        
               if err_type == 3: 
        
                   CPLErrorReset() 
        
                   return exception_map.get( 
        
                       err_no, CPLE_BaseError)(err_type, err_no, msg) 
        
               if err_type == 4: 
        
                   return SystemExit("Fatal error: {0}".format((err_type, err_no, msg))) 
        
               else: 
        
                   return

that could maybe be expanded to raise python warnings (or suppress them) as well (err_type 1 is Debug and 2 is Warning, while we currently only handle 3 (Failure) and 4 (Fatal)). Although we also have a logging set up, so that should maybe be used for this instead of python warnings? (@brendan-ward)

brendan-ward · 2022-03-10T15:05:34Z

I'm not sure what the best path forward on warnings vs logging is here, but I don't want to hold this PR up. Perhaps we can push dealing with that to another issue, so that this can get merged (after resolving conflicts)?

martinfleis · 2022-03-10T19:46:49Z

Perhaps we can push dealing with that to another issue, so that this can get merged

That is okay with me but I'd like to resolve that before then next release if possible.

pyogrio/_ogr.pyx

pyogrio/tests/test_raw_io.py

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

brendan-ward

Thanks for continuing to work on this @jorisvandenbossche

A few suggested changes, and then this looks ready.

pyogrio/_ogr.pyx

brendan-ward · 2022-03-11T15:22:45Z

pyogrio/_ogr.pyx

+
+    vsi_filename = '/vsimem/{}'.format(uuid4().hex + ext)
+
+    vsi_handle = VSIFileFromMemBuffer(vsi_filename.encode("utf8"), <unsigned char *>bytesbuf, len(bytesbuf), 0)


Elsewhere we usually handle Python => C strings in multiple steps, I thought in part because not doing so triggers a compilation error. And we've standarded on using "UTF-8" to refer to unicode in Cython.

So this would be

char *filename_c = NULL filename_b = vsi_filename.encode("UTF-8") filename_c = filename_b vsi_handle = VSIFileFromMemBuffer(filename_c, <unsigned char *>bytesbuf, len(bytesbuf), 0)

Though I'm not sure that is strictly necessary. (same for remove_virtual_filename too)

It doesn't seem to be necessary, since it is compiling here?

(but already changed utf8 to UTF-8)

pyogrio/geopandas.py

pyogrio/raw.py

Co-authored-by: Brendan Ward <bcward@astutespruce.com>

brendan-ward

Thanks @jorisvandenbossche !

brendan-ward · 2022-04-01T17:41:43Z

pyogrio/tests/test_raw_io.py

+
+    assert np.array_equal(meta1["fields"], meta2["fields"])
+    assert np.array_equal(index1, index2)
+    # assert np.array_equal(geometry1, geometry2)


minor nit: remove commented line

Turned it into a small explanation why we are using pygeos here

ENH: support reading from in-memory buffers

0a10bc4

jorisvandenbossche commented Nov 12, 2021

View reviewed changes

jorisvandenbossche mentioned this pull request Nov 12, 2021

ENH: support using pyogrio in read_file / to_file with engine keyword geopandas/geopandas#2225

Merged

jorisvandenbossche added 5 commits November 12, 2021 21:43

passthrough bytes in read_dataframe

60b8d28

Merge remote-tracking branch 'upstream/main' into read-in-memory

d81a70c

fixup merge

dadea0c

attribute buffer_to_virtual_file to fiona + clean up ogr.pxd

083a32a

clean-up + docstrings

9b3bdf3

jorisvandenbossche marked this pull request as ready for review January 27, 2022 11:21

jorisvandenbossche requested review from brendan-ward and martinfleis January 27, 2022 11:21

martinfleis reviewed Jan 27, 2022

View reviewed changes

pyogrio/geopandas.py Show resolved Hide resolved

pyogrio/tests/test_raw_io.py Show resolved Hide resolved

brendan-ward reviewed Jan 27, 2022

View reviewed changes

brendan-ward added this to the Version 0.4.0 milestone Jan 28, 2022

jorisvandenbossche added 5 commits February 9, 2022 09:42

parametrize test

fa9808b

Merge remote-tracking branch 'upstream/main' into read-in-memory

6d1aeff

support file-like objects

719399d

Merge remote-tracking branch 'upstream/main' into read-in-memory

1502ab1

make first argument positional-only

b8f7985

jorisvandenbossche added 2 commits March 11, 2022 09:53

skip pygeos if not present

bc1b4f1

Merge remote-tracking branch 'upstream/main' into read-in-memory

792210b

martinfleis reviewed Mar 11, 2022

View reviewed changes

pyogrio/_ogr.pyx Outdated Show resolved Hide resolved

fixup merge

3e055f3

martinfleis reviewed Mar 11, 2022

View reviewed changes

pyogrio/tests/test_raw_io.py Outdated Show resolved Hide resolved

Update pyogrio/tests/test_raw_io.py

74f12a9

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

martinfleis approved these changes Mar 11, 2022

View reviewed changes

brendan-ward reviewed Mar 11, 2022

View reviewed changes

jorisvandenbossche and others added 4 commits April 1, 2022 10:55

Update pyogrio/_ogr.pyx

41f70d4

Co-authored-by: Brendan Ward <bcward@astutespruce.com>

Merge remote-tracking branch 'upstream/main' into read-in-memory

bd68a50

small edits

dadcc40

add whatsnew

985a1ab

brendan-ward approved these changes Apr 1, 2022

View reviewed changes

update comment

332680e

jorisvandenbossche merged commit 22f6878 into geopandas:main Apr 2, 2022

jorisvandenbossche deleted the read-in-memory branch April 2, 2022 12:02

jorisvandenbossche mentioned this pull request Apr 30, 2023

ENH: support reading from in-memory (byte) objects #22

Closed

		assert_equal_result((meta, geometry, field_data), (meta2, geometry2, field_data2))


		filename = os.path.join(str(tmpdir), "test.geojson")


		vsi_filename = '/vsimem/{}'.format(uuid4().hex + ext)

		vsi_handle = VSIFileFromMemBuffer(vsi_filename.encode("utf8"), <unsigned char *>bytesbuf, len(bytesbuf), 0)

Uh oh!

ENH: support reading from in-memory buffers #25

ENH: support reading from in-memory buffers #25

Uh oh!

Conversation

jorisvandenbossche commented Nov 12, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

martinfleis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche commented Jan 27, 2022

Uh oh!

brendan-ward left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jan 28, 2022

Uh oh!

martinfleis commented Feb 25, 2022

Uh oh!

jorisvandenbossche commented Feb 25, 2022

Uh oh!

brendan-ward commented Mar 10, 2022

Uh oh!

martinfleis commented Mar 10, 2022

Uh oh!

Uh oh!

Uh oh!

brendan-ward left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brendan-ward left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

martinfleis left a comment •

edited

Loading