Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs #303

Open
TomNicholas opened this issue Jul 26, 2024 · 2 comments

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Jul 26, 2024

Idea: Use zarr readers to open and decode netCDF/HDF/etc. data without xarray by lifting xarray's decoding machinery out as new zarr codecs.

This was suggested by @sharkinsspatial in zarr-developers/VirtualiZarr#68 (comment) and requires two components:

  1. The chunk manifest storage transformer proposed in Manifest storage transformer #287, which would allow zarr stores to redirect zarr readers to read byte ranges from inside arbitrary files, including legacy formats such as netCDF. We (particularly @abarciauskas-bgse, Sean and myself) are working on making this happen already, so that we can open netCDF data via zarr using xarray, effectively upstreaming kerchunk's references format as a zarr extension.
  2. Decoding according to CF conventions via new Zarr codecs. This is currently done automatically and somewhat opaquely by xarray when reading a netCDF file directly, but it's still done by xarray even when we read a netCDF file via kerchunk/virtualizarr byte range references. This decoding step is well-factored out internally inside xarray but not really publicly exposed (at least not without the rest of xarray as a dependency). The suggestion (originally from @rabernat in How to handle encoding VirtualiZarr#68 (comment)) is to lift that code out of xarray as a set of CF-specific zarr codecs that get called when a zarr reader opens a store with a manifest pointing to a netCDF file.

To be really useful this probably also requires variable-length chunking in zarr (i.e. ZEP003).

The advantages of this are:
a) a clearer separation of concerns, with fewer "magic" steps hidden inside xarray,
b) applications that can read zarr but don't want to use xarray could also read and fully decode netCDF data (i.e. pure-zarr users see the same data as xarray users),
c) clearer steps towards generalizing to non-CF encoding conventions used in other domains of science,
d) opening the door to zarr becoming a "universal reader" of any file format whose data can be expressed as a manifest of byte ranges and decoding steps can be expressed as zarr codecs.

Most of the work here would be on the xarray end - there is an ancient issue suggesting something similar in pydata/xarray#155, and a nice explanation of how xarray currently does this step in pydata/xarray#8548. Currently it looks essentially like this

xarray.Dataset < dask chunking < CF decoding (using xarray's VariableCoder) < opening via datastore < file

where one of xarray's options for datastore is for zarr, and another is for netCDF (these are xarray's "backends"). I'm proposing something more like

xarray.Dataset < dask chunking < zarr.Array < CF decoding (using new zarr codecs) < open via "universal" zarr reader < chunk manifest < file

where non-xarray users can still get all of

zarr.Array < CF decoding (using new zarr codecs) < open via "universal" zarr reader < chunk manifest < file


One question is how well does xarray's internal concept of a VariableCoder map onto a zarr codec?

@d-v-b
Copy link
Contributor

d-v-b commented Jul 26, 2024

Thanks for the writeup Tom, a big +1 from me on this effort.

One question is how well does xarray's internal concept of a VariableCoder map onto a zarr codec?

From glancing at the signature and a few implementations, it looks like the VariableCoder is totally compatible with the v3 codecs. If I understand correctly, CF Variables are n-dimensional arrays, so we might be looking at translating these to ArrayArrayCodecs

@TomNicholas
Copy link
Member Author

From glancing at the signature and a few implementations, it looks like the VariableCoder is totally compatible with the v3 codecs. If I understand correctly, CF Variables are n-dimensional arrays, so we might be looking at translating these to ArrayArrayCodecs

Does an ArrayArrayCodec know about the names of dimensions? Or metadata attributes (i.e. .zmetadata)? Because the VariableCoder has access to that information, as it is stored on the xarray.Variable object passed in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants