Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC (do not merge): Add row-group and column select #222

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

martindurant
Copy link
Contributor

Makes read_parquet(filename, rgs=None, columns=None), where the optional inputs are integer lists. To be useful, the user needs to first know how many row-groups there are (this is easy) and how the columns they want map onto schema indices (this is hard for the nested case).

I note that loading refuses (correctly) in the case that you have a MAP type and you only specify the keys or values but not both.

I also found that "filename" must be a real file not a directory, which I found surprising, since parquet datasets are usually multi-file. Of course dask et al already has code for extracting the potentially many files of a dataset, first constructing an aggregate schema including any partitioning keys. I suppose this paragraph belongs in #195 .

@kylebarron
Copy link
Owner

Thanks for the PR!

Makes read_parquet(filename, rgs=None, columns=None), where the optional inputs are integer lists. To be useful, the user needs to first know how many row-groups there are (this is easy) and how the columns they want map onto schema indices (this is hard for the nested case).

Right, this is where it's nicer to have a stateful reader like ParquetFile. For reference, I have ParquetFile and ParquetDataset in geoarrow-rs. These should work for non-spatial data, but are spatial-aware.

I also found that "filename" must be a real file not a directory, which I found surprising, since parquet datasets are usually multi-file. Of course dask et al already has code for extracting the potentially many files of a dataset, first constructing an aggregate schema including any partitioning keys. I suppose this paragraph belongs in #195 .

The Rust parquet crate API only supports single files, so any dataset abstraction needs to be built on top of that. You can refer to ParquetDataset in geoarrow-rs: https://github.com/geoarrow/geoarrow-rs/blob/d4e6957bd73f70f9d00a8b6b8c2d66f7c0622deb/python/geoarrow-io/src/io/parquet/async.rs#L313-L321

@martindurant
Copy link
Contributor Author

martindurant commented Oct 10, 2024

Right, this is where it's nicer to have a stateful reader like ParquetFile. For reference, I have ParquetFile and ParquetDataset in geoarrow-rs. These should work for non-spatial data, but are spatial-aware.

I suppose I am a little confused about how you see there different packages relating to each other. Clearly geoarrow cares about parquet, but it also uses arro3 internally, so why wouldn't dataset live here? I see it also supports remote stores...

geoarrow, of course, depends on the whole of pyarrow (all that C++ stuff). Not to mention GEOS and GDAL!!

@kylebarron
Copy link
Owner

I suppose I am a little confused about how you see there different packages relating to each other. Clearly geoarrow cares about parquet, but it also uses arro3 internally, so why wouldn't dataset live here? I see it also supports remote stores...

Because I don't know of a good way (yet) to have the main Python binding live here but inject spatial-specific extensions from geoarrow-rs's Python bindings. If we figured that out then I would put the core Python parquet bindings here.

It's the same story on the JS side, where I have WebAssembly bindings to Parquet, but then reimplement a bunch of that stuff for GeoParquet in JS. https://github.com/geoarrow/geoarrow-rs/tree/main/js/src/io/parquet

geoarrow, of course, depends on the whole of pyarrow (all that C++ stuff). Not to mention GEOS and GDAL!!

No. geoarrow-rs does not depend on pyarrow, nor GEOS nor GDAL. geoarrow-rs is pure Rust, and especially the GeoParquet reader and all the core dependencies are pure Rust. That's what allows it to go in WebAssembly so easily https://explore.overturemaps.org (clicking Download Visible uses my Rust GeoParquet reader). geoarrow-rs is separate from geoarrow-c and geoarrow-pyarrow, which depend on pyarrow.

@kylebarron
Copy link
Owner

Clearly geoarrow cares about parquet, but it also uses arro3 internally

To be pedantic, the geoarrow crate does not use arro3 internally. The geoarrow crate and its Python bindings only use pyo3-arrow. arro3-core is listed as a Python dependency only so that users have a friendly API to work with Arrow table data.

pub fn read_parquet(
py: Python,
file: FileReader,
rgs: Option<Vec<usize>>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly! It really ought to be on the reader (.select() or such) anyway, so that you only read the parquet metadata once. Or maybe have a separate "fetch details" function. This was only minimalist POC to show that you can pick data portions like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants