ENH: add IO methods for reading/writing geospatial Feather datasets #91

jorisvandenbossche · 2021-08-05T14:23:50Z

First version to have a Feather dataset reader and writer. I think it should be mostly working, but mostly needs more test coverage (and also a bit of clean-up, and some TODOs remaining).

This adds a dask_geopandas.read_feather("path/to/dataset") and GeoDataFrame.to_feathe("path/to/dataset").

brendan-ward

A few suggestions to consider.

dask_geopandas/io/arrow.py

brendan-ward · 2021-08-05T19:50:37Z

dask_geopandas/io/arrow.py

+        try:
+            return _arrow_to_geopandas(arrow_table)
+        except ValueError as err:
+            # when no geometry column is selected, the above will error.


To do more robust fallback to read the file even when no geometry columns are included in the user-defined columns to read from the file, you need to compare the user-defined columns against the geo columns listed in the metadata prior to _arrow_to_geopandas.

Alternatively, in geopandas we could consider adding an optional keyword to ignore this error gracefully instead of raising an exception (e.g., `missing_geometry="ignore" or something like that).

But I'm not clear on what the advantage is here of supporting reads of geo feather files and specifically ignoring their geometry columns.

But I'm not clear on what the advantage is here of supporting reads of geo feather files and specifically ignoring their geometry columns.

This is specifically to support that you do ddf = dask_geopandas.read_feather(...); ddf["attribute"]....compute(), where ddf is a GeoDataFrame but the result will be a normal DataFrame, because of pushing down the column selection.

But as you say, this could also be done by changing geopandas to not error when no geometry column is present. But so with current geopandas, the above is needed.

martinfleis · 2022-01-17T21:41:53Z

Two points so far:

This seems to break parquet IO. The snippet below runs fine on main.

import geopandas
import dask_geopandas

df = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
df.to_parquet("ne.parquet")
ddf = dask_geopandas.read_parquet("ne.parquet")

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_49947/640206072.py in <module>
      1 # ddf = dask_geopandas.read_feather("ne.feather")
----> 2 ddf = dask_geopandas.read_parquet("ne.parquet")

~/Git/dask-geopandas/dask_geopandas/io/parquet.py in read_parquet(*args, **kwargs)
     97 
     98 def read_parquet(*args, **kwargs):
---> 99     result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
    100     # check if spatial partitioning information was stored
    101     spatial_partitions = result._meta.attrs.get("spatial_partitions", None)

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, ignore_metadata_file, metadata_task_size, split_row_groups, chunksize, aggregate_files, **kwargs)
    323         gather_statistics = True
    324 
--> 325     read_metadata_result = engine.read_metadata(
    326         fs,
    327         paths,

~/Git/dask-geopandas/dask_geopandas/io/parquet.py in read_metadata(cls, fs, paths, **kwargs)
     58         # get spatial partitions if available
     59         regions = geopandas.GeoSeries(
---> 60             [_get_partition_bounds(part, fs) for part in parts], crs=meta.crs
     61         )
     62         if regions.notna().all():

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5485         ):
   5486             return self[name]
-> 5487         return object.__getattribute__(self, name)
   5488 
   5489     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'crs'

spatial partitions are improperly treated. Same snippet is fine with parquet and main branch

import geopandas
import dask_geopandas

df = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
df.to_feather("ne.feather")
ddf = dask_geopandas.read_feather("ne.feather")
ddf = ddf.set_index("name", npartitions=4, shuffle="tasks")

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_50120/4252723701.py in <module>
----> 1 ddf = ddf.set_index("name", npartitions=4, shuffle="tasks")

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
   4508             from .shuffle import set_index
   4509 
-> 4510             return set_index(
   4511                 self,
   4512                 other,

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
    201             return result.map_partitions(M.sort_index)
    202 
--> 203     return set_partition(
    204         df, index, divisions, shuffle=shuffle, drop=drop, compute=compute, **kwargs
    205     )

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/shuffle.py in set_partition(df, index, divisions, max_branch, drop, shuffle, compute)
    289             set_partitions_pre, divisions=divisions, meta=meta
    290         )
--> 291         df2 = df.assign(_partitions=partitions)
    292     else:
    293         partitions = index.map_partitions(

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in assign(self, **kwargs)
   4575     @derived_from(pd.DataFrame)
   4576     def assign(self, **kwargs):
-> 4577         data = self.copy()
   4578         for k, v in kwargs.items():
   4579             if not (

~/Git/dask-geopandas/dask_geopandas/core.py in copy(self)
    178         """
    179         self_copy = super().copy()
--> 180         self_copy.spatial_partitions = self.spatial_partitions.copy()
    181         return self_copy
    182 

AttributeError: 'NoneType' object has no attribute 'copy'

jorisvandenbossche · 2022-02-09T13:43:03Z

This seems to break parquet IO. The snippet below runs fine on main.

I suppose this is because this branch was outdated, and main includes some fixes for dask compat. In any case, on an updated branch, that snippet seems to work.

spatial partitions are improperly treated. Same snippet is fine with parquet and main branch

Also that is a bug that is fixed in the meantime on main (copy() was trying to copy the spatial partitions always, even if it is None).
But, that also shows an important TODO: reading in the spatial partitions information and populating that property.

jorisvandenbossche · 2022-02-09T14:31:38Z

I updated the reader side to correctly set the CRS on the dask GeoDataFrame, and to read in the spatial partitioning information.

brendan-ward

Thanks for working on this @jorisvandenbossche - great to see support for Feather files here.

A few comments to consider, but don't let them hold up merging this in.

brendan-ward · 2022-02-10T18:10:01Z

dask_geopandas/io/arrow.py

+        geometry_column_name = geo_meta["primary_column"]
+        crs = geo_meta["columns"][geometry_column_name]["crs"]
+        geometry_columns = geo_meta["columns"]
+    else:


If this is a TODO, maybe this should raise a NotImplementedError until that is in place?

I wouldn't raise an error. Incorrectly read file can be fixed (like if you read geo-arrow parquet with vanilla dask.dataframe) but if we raise you'd need to explicilty read it with dask.dataframe. But I would maybe warn at least (though geopandas probably does that already?)

Yes, geopandas will currently actually raise an error in that case. But, that error will only happen when computing, while it's probably good to already raise an error upfront.

With the current PR, we actually already run into an error in this case, because loading of the partitions bounds fails. But that's not a very useful error.

Short term: will add a descriptive error and test for this.

brendan-ward · 2022-02-10T18:21:15Z

dask_geopandas/io/arrow.py

+        glob character if a single string.
+    columns: None or list(str)
+        Columns to load. If None, loads all.
+    index: str


filters parameter is missing here.

Yeah, the docs still needed some updates. All parameters should now be documented and working/tested.

brendan-ward · 2022-02-10T18:21:58Z

dask_geopandas/io/arrow.py

+    index: str
+        Column name to set as index.
+    storage_options: None or dict
+        Further parameters to pass to the bytes backend.


It might be a good idea to add in a link or suggestion about where to go for more information about this parameter.

brendan-ward · 2022-02-10T18:25:29Z

tests/io/test_arrow.py

+    result_part0.index.name = None
+    assert_geodataframe_equal(result_part0, df.iloc[:45])
+
+    # TODO geopandas doesn't actually support this for "feather" format


Unclear if this commented block should be removed; since it is in a test case, might be better to move this to an issue and remove from here.

Opened an issue -> geopandas/geopandas#2348

martinfleis · 2022-02-10T19:36:57Z

Tangential question. Why does both to_parquet and to_feather return a tuple with None? With parquet it is not so annoying as it returns only (None,) no matter the number of partitions but feather returns a tuple of None of length equal to the number of partitions, as in (None, None, None, None). I see that it is coming from dask.dataframe but I don't know why.

martinfleis

Is there a reason why we now have arrow.py that inludes functions that are share among feather and parquet, and implementation of feather IO, and separate parquet.py that includes parquet IO? Shouldn't it be either all in the arrow.py file or have arrow.py, parquet.py and feather.py?

martinfleis · 2022-02-10T20:11:50Z

dask_geopandas/io/arrow.py

+        geometry_column_name = geo_meta["primary_column"]
+        crs = geo_meta["columns"][geometry_column_name]["crs"]
+        geometry_columns = geo_meta["columns"]
+    else:


I wouldn't raise an error. Incorrectly read file can be fixed (like if you read geo-arrow parquet with vanilla dask.dataframe) but if we raise you'd need to explicilty read it with dask.dataframe. But I would maybe warn at least (though geopandas probably does that already?)

jorisvandenbossche · 2022-02-11T15:47:08Z

OK, I gave this another good update (ensuring that remote filesystems work, fixing the index keyword, and adding more tests for that).

Tangential question. Why does both to_parquet and to_feather return a tuple with None?

We use compute_as_if_collection from dask for the final return value, so that is coming from there (but also no idea why it does that).

Is there a reason why we now have arrow.py that inludes functions that are share among feather and parquet, and implementation of feather IO, and separate parquet.py that includes parquet IO? Shouldn't it be either all in the arrow.py file or have arrow.py, parquet.py and feather.py?

The reason I kept the existing parquet.py as a separate file (except for moving a bunch of code to arrow.py that could be shared) is because for the Parquet IO, we still subclass the dask engine (and thus get all detailed functionality from dask), only overriding a few things to ensure a proper GeoDataFrame and handle the geo metadata. While for the Feather IO, I implemented a fully custom (and simpler) engine based on pyarrow.datasets (dask itself doesn't have Feather IO).
I could split the Feather-specific things into a separate feather.py, but IMO that's probably not worth it. But I should add some more comments to the engine classes to document this design.

brendan-ward

Looks good to me, thanks for the updates @jorisvandenbossche

the docstring for the filter was not immediately clear to me (but no obvious edits to make it more clear); I think that is likely due to the complexity of the implementation. This is likely to be a case where examples will be helpful later.

martinfleis

One minor note on the new error message. Feel free to merge afterwards! Thanks!

dask_geopandas/io/arrow.py

jorisvandenbossche · 2022-02-11T20:04:49Z

the docstring for the filter was not immediately clear to me (but no obvious edits to make it more clear); I think that is likely due to the complexity of the implementation. This is likely to be a case where examples will be helpful later.

Yeah, it's certainly not the easiest explanation. But it's mostly copied from dask. It would be good to add some more examples.

ENH: add read_feather for reading geospatial Feather datasets

9e59a77

brendan-ward reviewed Aug 5, 2021

View reviewed changes

add to_feather writer

0de0ac6

jorisvandenbossche changed the title ~~ENH: add read_feather for reading geospatial Feather datasets~~ ENH: add IO methods for reading/writing geospatial Feather datasets Aug 12, 2021

jorisvandenbossche added 3 commits August 12, 2021 12:03

edits from review

644f203

fix init

ea11d58

clean-up debugging print

f630901

jorisvandenbossche marked this pull request as ready for review August 12, 2021 10:07

jorisvandenbossche mentioned this pull request Aug 13, 2021

BUG: compute=False not working correctly in new ORC writer dask/dask#8022

Closed

martinfleis self-requested a review September 6, 2021 12:56

Merge remote-tracking branch 'upstream/master' into feather-dataset

755f2ed

jorisvandenbossche force-pushed the feather-dataset branch from 692070c to 755f2ed Compare October 29, 2021 16:07

martinfleis added this to the 0.1 milestone Jan 17, 2022

Merge remote-tracking branch 'upstream/main' into feather-dataset

79a90b9

jorisvandenbossche added 2 commits February 9, 2022 15:25

Support reading CRS and spatial partitions

2f8776e

Merge remote-tracking branch 'upstream/main' into feather-dataset

0407b61

brendan-ward reviewed Feb 10, 2022

View reviewed changes

martinfleis reviewed Feb 10, 2022

View reviewed changes

jorisvandenbossche added 5 commits February 11, 2022 10:49

raise proper error if no geo metadata

9c4ca6d

document + test the filters keyword

5dc25d9

document, test and fix index argument

7777812

document, test and fix remote filesystem / storage_options

ba7b051

Merge remote-tracking branch 'upstream/main' into feather-dataset

15b2591

add some comments

82353c4

brendan-ward approved these changes Feb 11, 2022

View reviewed changes

martinfleis approved these changes Feb 11, 2022

View reviewed changes

dask_geopandas/io/arrow.py Outdated Show resolved Hide resolved

update warning

fce6261

jorisvandenbossche merged commit a978e2d into geopandas:main Feb 11, 2022

jorisvandenbossche deleted the feather-dataset branch February 11, 2022 20:33

martinfleis mentioned this pull request Feb 14, 2022

CI: use strict channel priority to avoid GDAL from main channel #153

Closed

jorisvandenbossche mentioned this pull request Feb 14, 2022

DEP: bump dask minimum required version to 2021.06.0 #164

Merged

Uh oh!

ENH: add IO methods for reading/writing geospatial Feather datasets #91

ENH: add IO methods for reading/writing geospatial Feather datasets #91

Uh oh!

Conversation

jorisvandenbossche commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brendan-ward left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martinfleis commented Jan 17, 2022

Uh oh!

jorisvandenbossche commented Feb 9, 2022

Uh oh!

jorisvandenbossche commented Feb 9, 2022

Uh oh!

brendan-ward left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martinfleis commented Feb 10, 2022

Uh oh!

martinfleis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Feb 11, 2022

Uh oh!

brendan-ward left a comment

Choose a reason for hiding this comment

Uh oh!

martinfleis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Feb 11, 2022

Uh oh!

Uh oh!

jorisvandenbossche commented Aug 5, 2021 •

edited

Loading