Skip to content

ENH: add IO methods for reading/writing geospatial Feather datasets #91

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Feb 11, 2022

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 5, 2021

First version to have a Feather dataset reader and writer. I think it should be mostly working, but mostly needs more test coverage (and also a bit of clean-up, and some TODOs remaining).

This adds a dask_geopandas.read_feather("path/to/dataset") and GeoDataFrame.to_feathe("path/to/dataset").

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions to consider.

try:
return _arrow_to_geopandas(arrow_table)
except ValueError as err:
# when no geometry column is selected, the above will error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do more robust fallback to read the file even when no geometry columns are included in the user-defined columns to read from the file, you need to compare the user-defined columns against the geo columns listed in the metadata prior to _arrow_to_geopandas.

Alternatively, in geopandas we could consider adding an optional keyword to ignore this error gracefully instead of raising an exception (e.g., `missing_geometry="ignore" or something like that).

But I'm not clear on what the advantage is here of supporting reads of geo feather files and specifically ignoring their geometry columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm not clear on what the advantage is here of supporting reads of geo feather files and specifically ignoring their geometry columns.

This is specifically to support that you do ddf = dask_geopandas.read_feather(...); ddf["attribute"]....compute(), where ddf is a GeoDataFrame but the result will be a normal DataFrame, because of pushing down the column selection.

But as you say, this could also be done by changing geopandas to not error when no geometry column is present. But so with current geopandas, the above is needed.

@jorisvandenbossche jorisvandenbossche changed the title ENH: add read_feather for reading geospatial Feather datasets ENH: add IO methods for reading/writing geospatial Feather datasets Aug 12, 2021
@martinfleis
Copy link
Member

Two points so far:

  1. This seems to break parquet IO. The snippet below runs fine on main.
import geopandas
import dask_geopandas

df = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
df.to_parquet("ne.parquet")
ddf = dask_geopandas.read_parquet("ne.parquet")

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_49947/640206072.py in <module>
      1 # ddf = dask_geopandas.read_feather("ne.feather")
----> 2 ddf = dask_geopandas.read_parquet("ne.parquet")

~/Git/dask-geopandas/dask_geopandas/io/parquet.py in read_parquet(*args, **kwargs)
     97 
     98 def read_parquet(*args, **kwargs):
---> 99     result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
    100     # check if spatial partitioning information was stored
    101     spatial_partitions = result._meta.attrs.get("spatial_partitions", None)

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, ignore_metadata_file, metadata_task_size, split_row_groups, chunksize, aggregate_files, **kwargs)
    323         gather_statistics = True
    324 
--> 325     read_metadata_result = engine.read_metadata(
    326         fs,
    327         paths,

~/Git/dask-geopandas/dask_geopandas/io/parquet.py in read_metadata(cls, fs, paths, **kwargs)
     58         # get spatial partitions if available
     59         regions = geopandas.GeoSeries(
---> 60             [_get_partition_bounds(part, fs) for part in parts], crs=meta.crs
     61         )
     62         if regions.notna().all():

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5485         ):
   5486             return self[name]
-> 5487         return object.__getattribute__(self, name)
   5488 
   5489     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'crs'
  1. spatial partitions are improperly treated. Same snippet is fine with parquet and main branch
import geopandas
import dask_geopandas

df = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
df.to_feather("ne.feather")
ddf = dask_geopandas.read_feather("ne.feather")
ddf = ddf.set_index("name", npartitions=4, shuffle="tasks")

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_50120/4252723701.py in <module>
----> 1 ddf = ddf.set_index("name", npartitions=4, shuffle="tasks")

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
   4508             from .shuffle import set_index
   4509 
-> 4510             return set_index(
   4511                 self,
   4512                 other,

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
    201             return result.map_partitions(M.sort_index)
    202 
--> 203     return set_partition(
    204         df, index, divisions, shuffle=shuffle, drop=drop, compute=compute, **kwargs
    205     )

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/shuffle.py in set_partition(df, index, divisions, max_branch, drop, shuffle, compute)
    289             set_partitions_pre, divisions=divisions, meta=meta
    290         )
--> 291         df2 = df.assign(_partitions=partitions)
    292     else:
    293         partitions = index.map_partitions(

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in assign(self, **kwargs)
   4575     @derived_from(pd.DataFrame)
   4576     def assign(self, **kwargs):
-> 4577         data = self.copy()
   4578         for k, v in kwargs.items():
   4579             if not (

~/Git/dask-geopandas/dask_geopandas/core.py in copy(self)
    178         """
    179         self_copy = super().copy()
--> 180         self_copy.spatial_partitions = self.spatial_partitions.copy()
    181         return self_copy
    182 

AttributeError: 'NoneType' object has no attribute 'copy'

@martinfleis martinfleis added this to the 0.1 milestone Jan 17, 2022
@jorisvandenbossche
Copy link
Member Author

This seems to break parquet IO. The snippet below runs fine on main.

I suppose this is because this branch was outdated, and main includes some fixes for dask compat. In any case, on an updated branch, that snippet seems to work.

spatial partitions are improperly treated. Same snippet is fine with parquet and main branch

Also that is a bug that is fixed in the meantime on main (copy() was trying to copy the spatial partitions always, even if it is None).
But, that also shows an important TODO: reading in the spatial partitions information and populating that property.

@jorisvandenbossche
Copy link
Member Author

I updated the reader side to correctly set the CRS on the dask GeoDataFrame, and to read in the spatial partitioning information.

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @jorisvandenbossche - great to see support for Feather files here.

A few comments to consider, but don't let them hold up merging this in.

geometry_column_name = geo_meta["primary_column"]
crs = geo_meta["columns"][geometry_column_name]["crs"]
geometry_columns = geo_meta["columns"]
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a TODO, maybe this should raise a NotImplementedError until that is in place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't raise an error. Incorrectly read file can be fixed (like if you read geo-arrow parquet with vanilla dask.dataframe) but if we raise you'd need to explicilty read it with dask.dataframe. But I would maybe warn at least (though geopandas probably does that already?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, geopandas will currently actually raise an error in that case. But, that error will only happen when computing, while it's probably good to already raise an error upfront.

With the current PR, we actually already run into an error in this case, because loading of the partitions bounds fails. But that's not a very useful error.

Short term: will add a descriptive error and test for this.

glob character if a single string.
columns: None or list(str)
Columns to load. If None, loads all.
index: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filters parameter is missing here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the docs still needed some updates. All parameters should now be documented and working/tested.

index: str
Column name to set as index.
storage_options: None or dict
Further parameters to pass to the bytes backend.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to add in a link or suggestion about where to go for more information about this parameter.

result_part0.index.name = None
assert_geodataframe_equal(result_part0, df.iloc[:45])

# TODO geopandas doesn't actually support this for "feather" format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear if this commented block should be removed; since it is in a test case, might be better to move this to an issue and remove from here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue -> geopandas/geopandas#2348

@martinfleis
Copy link
Member

Tangential question. Why does both to_parquet and to_feather return a tuple with None? With parquet it is not so annoying as it returns only (None,) no matter the number of partitions but feather returns a tuple of None of length equal to the number of partitions, as in (None, None, None, None). I see that it is coming from dask.dataframe but I don't know why.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we now have arrow.py that inludes functions that are share among feather and parquet, and implementation of feather IO, and separate parquet.py that includes parquet IO? Shouldn't it be either all in the arrow.py file or have arrow.py, parquet.py and feather.py?

geometry_column_name = geo_meta["primary_column"]
crs = geo_meta["columns"][geometry_column_name]["crs"]
geometry_columns = geo_meta["columns"]
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't raise an error. Incorrectly read file can be fixed (like if you read geo-arrow parquet with vanilla dask.dataframe) but if we raise you'd need to explicilty read it with dask.dataframe. But I would maybe warn at least (though geopandas probably does that already?)

@jorisvandenbossche
Copy link
Member Author

OK, I gave this another good update (ensuring that remote filesystems work, fixing the index keyword, and adding more tests for that).

Tangential question. Why does both to_parquet and to_feather return a tuple with None?

We use compute_as_if_collection from dask for the final return value, so that is coming from there (but also no idea why it does that).

Is there a reason why we now have arrow.py that inludes functions that are share among feather and parquet, and implementation of feather IO, and separate parquet.py that includes parquet IO? Shouldn't it be either all in the arrow.py file or have arrow.py, parquet.py and feather.py?

The reason I kept the existing parquet.py as a separate file (except for moving a bunch of code to arrow.py that could be shared) is because for the Parquet IO, we still subclass the dask engine (and thus get all detailed functionality from dask), only overriding a few things to ensure a proper GeoDataFrame and handle the geo metadata. While for the Feather IO, I implemented a fully custom (and simpler) engine based on pyarrow.datasets (dask itself doesn't have Feather IO).
I could split the Feather-specific things into a separate feather.py, but IMO that's probably not worth it. But I should add some more comments to the engine classes to document this design.

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for the updates @jorisvandenbossche

the docstring for the filter was not immediately clear to me (but no obvious edits to make it more clear); I think that is likely due to the complexity of the implementation. This is likely to be a case where examples will be helpful later.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor note on the new error message. Feel free to merge afterwards! Thanks!

@jorisvandenbossche
Copy link
Member Author

the docstring for the filter was not immediately clear to me (but no obvious edits to make it more clear); I think that is likely due to the complexity of the implementation. This is likely to be a case where examples will be helpful later.

Yeah, it's certainly not the easiest explanation. But it's mostly copied from dask. It would be good to add some more examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants