-
-
Notifications
You must be signed in to change notification settings - Fork 49
ENH: add IO methods for reading/writing geospatial Feather datasets #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add IO methods for reading/writing geospatial Feather datasets #91
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions to consider.
try: | ||
return _arrow_to_geopandas(arrow_table) | ||
except ValueError as err: | ||
# when no geometry column is selected, the above will error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do more robust fallback to read the file even when no geometry columns are included in the user-defined columns to read from the file, you need to compare the user-defined columns against the geo columns listed in the metadata prior to _arrow_to_geopandas
.
Alternatively, in geopandas we could consider adding an optional keyword to ignore this error gracefully instead of raising an exception (e.g., `missing_geometry="ignore" or something like that).
But I'm not clear on what the advantage is here of supporting reads of geo feather files and specifically ignoring their geometry columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I'm not clear on what the advantage is here of supporting reads of geo feather files and specifically ignoring their geometry columns.
This is specifically to support that you do ddf = dask_geopandas.read_feather(...); ddf["attribute"]....compute()
, where ddf
is a GeoDataFrame but the result will be a normal DataFrame, because of pushing down the column selection.
But as you say, this could also be done by changing geopandas to not error when no geometry column is present. But so with current geopandas, the above is needed.
692070c
to
755f2ed
Compare
Two points so far:
import geopandas
import dask_geopandas
df = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
df.to_parquet("ne.parquet")
ddf = dask_geopandas.read_parquet("ne.parquet")
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_49947/640206072.py in <module>
1 # ddf = dask_geopandas.read_feather("ne.feather")
----> 2 ddf = dask_geopandas.read_parquet("ne.parquet")
~/Git/dask-geopandas/dask_geopandas/io/parquet.py in read_parquet(*args, **kwargs)
97
98 def read_parquet(*args, **kwargs):
---> 99 result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
100 # check if spatial partitioning information was stored
101 spatial_partitions = result._meta.attrs.get("spatial_partitions", None)
~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, ignore_metadata_file, metadata_task_size, split_row_groups, chunksize, aggregate_files, **kwargs)
323 gather_statistics = True
324
--> 325 read_metadata_result = engine.read_metadata(
326 fs,
327 paths,
~/Git/dask-geopandas/dask_geopandas/io/parquet.py in read_metadata(cls, fs, paths, **kwargs)
58 # get spatial partitions if available
59 regions = geopandas.GeoSeries(
---> 60 [_get_partition_bounds(part, fs) for part in parts], crs=meta.crs
61 )
62 if regions.notna().all():
~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/pandas/core/generic.py in __getattr__(self, name)
5485 ):
5486 return self[name]
-> 5487 return object.__getattribute__(self, name)
5488
5489 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'crs'
import geopandas
import dask_geopandas
df = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
df.to_feather("ne.feather")
ddf = dask_geopandas.read_feather("ne.feather")
ddf = ddf.set_index("name", npartitions=4, shuffle="tasks")
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_50120/4252723701.py in <module>
----> 1 ddf = ddf.set_index("name", npartitions=4, shuffle="tasks")
~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
4508 from .shuffle import set_index
4509
-> 4510 return set_index(
4511 self,
4512 other,
~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
201 return result.map_partitions(M.sort_index)
202
--> 203 return set_partition(
204 df, index, divisions, shuffle=shuffle, drop=drop, compute=compute, **kwargs
205 )
~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/shuffle.py in set_partition(df, index, divisions, max_branch, drop, shuffle, compute)
289 set_partitions_pre, divisions=divisions, meta=meta
290 )
--> 291 df2 = df.assign(_partitions=partitions)
292 else:
293 partitions = index.map_partitions(
~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in assign(self, **kwargs)
4575 @derived_from(pd.DataFrame)
4576 def assign(self, **kwargs):
-> 4577 data = self.copy()
4578 for k, v in kwargs.items():
4579 if not (
~/Git/dask-geopandas/dask_geopandas/core.py in copy(self)
178 """
179 self_copy = super().copy()
--> 180 self_copy.spatial_partitions = self.spatial_partitions.copy()
181 return self_copy
182
AttributeError: 'NoneType' object has no attribute 'copy' |
I suppose this is because this branch was outdated, and main includes some fixes for dask compat. In any case, on an updated branch, that snippet seems to work.
Also that is a bug that is fixed in the meantime on main ( |
I updated the reader side to correctly set the CRS on the dask GeoDataFrame, and to read in the spatial partitioning information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @jorisvandenbossche - great to see support for Feather files here.
A few comments to consider, but don't let them hold up merging this in.
geometry_column_name = geo_meta["primary_column"] | ||
crs = geo_meta["columns"][geometry_column_name]["crs"] | ||
geometry_columns = geo_meta["columns"] | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a TODO, maybe this should raise a NotImplementedError
until that is in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't raise an error. Incorrectly read file can be fixed (like if you read geo-arrow parquet with vanilla dask.dataframe) but if we raise you'd need to explicilty read it with dask.dataframe. But I would maybe warn at least (though geopandas probably does that already?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, geopandas will currently actually raise an error in that case. But, that error will only happen when computing, while it's probably good to already raise an error upfront.
With the current PR, we actually already run into an error in this case, because loading of the partitions bounds fails. But that's not a very useful error.
Short term: will add a descriptive error and test for this.
dask_geopandas/io/arrow.py
Outdated
glob character if a single string. | ||
columns: None or list(str) | ||
Columns to load. If None, loads all. | ||
index: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filters
parameter is missing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the docs still needed some updates. All parameters should now be documented and working/tested.
dask_geopandas/io/arrow.py
Outdated
index: str | ||
Column name to set as index. | ||
storage_options: None or dict | ||
Further parameters to pass to the bytes backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to add in a link or suggestion about where to go for more information about this parameter.
result_part0.index.name = None | ||
assert_geodataframe_equal(result_part0, df.iloc[:45]) | ||
|
||
# TODO geopandas doesn't actually support this for "feather" format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear if this commented block should be removed; since it is in a test case, might be better to move this to an issue and remove from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened an issue -> geopandas/geopandas#2348
Tangential question. Why does both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we now have arrow.py
that inludes functions that are share among feather and parquet, and implementation of feather IO, and separate parquet.py
that includes parquet IO? Shouldn't it be either all in the arrow.py
file or have arrow.py
, parquet.py
and feather.py
?
geometry_column_name = geo_meta["primary_column"] | ||
crs = geo_meta["columns"][geometry_column_name]["crs"] | ||
geometry_columns = geo_meta["columns"] | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't raise an error. Incorrectly read file can be fixed (like if you read geo-arrow parquet with vanilla dask.dataframe) but if we raise you'd need to explicilty read it with dask.dataframe. But I would maybe warn at least (though geopandas probably does that already?)
OK, I gave this another good update (ensuring that remote filesystems work, fixing the index keyword, and adding more tests for that).
We use
The reason I kept the existing parquet.py as a separate file (except for moving a bunch of code to arrow.py that could be shared) is because for the Parquet IO, we still subclass the dask engine (and thus get all detailed functionality from dask), only overriding a few things to ensure a proper GeoDataFrame and handle the geo metadata. While for the Feather IO, I implemented a fully custom (and simpler) engine based on pyarrow.datasets (dask itself doesn't have Feather IO). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the updates @jorisvandenbossche
the docstring for the filter
was not immediately clear to me (but no obvious edits to make it more clear); I think that is likely due to the complexity of the implementation. This is likely to be a case where examples will be helpful later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor note on the new error message. Feel free to merge afterwards! Thanks!
Yeah, it's certainly not the easiest explanation. But it's mostly copied from dask. It would be good to add some more examples. |
First version to have a Feather dataset reader and writer. I think it should be mostly working, but mostly needs more test coverage (and also a bit of clean-up, and some TODOs remaining).
This adds a
dask_geopandas.read_feather("path/to/dataset")
andGeoDataFrame.to_feathe("path/to/dataset")
.