Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs #303

TomNicholas · 2024-07-26T01:46:16Z

Idea: Use zarr readers to open and decode netCDF/HDF/etc. data without xarray by lifting xarray's decoding machinery out as new zarr codecs.

This was suggested by @sharkinsspatial in zarr-developers/VirtualiZarr#68 (comment) and requires two components:

The chunk manifest storage transformer proposed in Manifest storage transformer #287, which would allow zarr stores to redirect zarr readers to read byte ranges from inside arbitrary files, including legacy formats such as netCDF. We (particularly @abarciauskas-bgse, Sean and myself) are working on making this happen already, so that we can open netCDF data via zarr using xarray, effectively upstreaming kerchunk's references format as a zarr extension.
Decoding according to CF conventions via new Zarr codecs. This is currently done automatically and somewhat opaquely by xarray when reading a netCDF file directly, but it's still done by xarray even when we read a netCDF file via kerchunk/virtualizarr byte range references. This decoding step is well-factored out internally inside xarray but not really publicly exposed (at least not without the rest of xarray as a dependency). The suggestion (originally from @rabernat in How to handle encoding VirtualiZarr#68 (comment)) is to lift that code out of xarray as a set of CF-specific zarr codecs that get called when a zarr reader opens a store with a manifest pointing to a netCDF file.

To be really useful this probably also requires variable-length chunking in zarr (i.e. ZEP003).

The advantages of this are:
a) a clearer separation of concerns, with fewer "magic" steps hidden inside xarray,
b) applications that can read zarr but don't want to use xarray could also read and fully decode netCDF data (i.e. pure-zarr users see the same data as xarray users),
c) clearer steps towards generalizing to non-CF encoding conventions used in other domains of science,
d) opening the door to zarr becoming a "universal reader" of any file format whose data can be expressed as a manifest of byte ranges and decoding steps can be expressed as zarr codecs.

Most of the work here would be on the xarray end - there is an ancient issue suggesting something similar in pydata/xarray#155, and a nice explanation of how xarray currently does this step in pydata/xarray#8548. Currently it looks essentially like this

xarray.Dataset < dask chunking < CF decoding (using xarray's VariableCoder) < opening via datastore < file

where one of xarray's options for datastore is for zarr, and another is for netCDF (these are xarray's "backends"). I'm proposing something more like

xarray.Dataset < dask chunking < zarr.Array < CF decoding (using new zarr codecs) < open via "universal" zarr reader < chunk manifest < file

where non-xarray users can still get all of

zarr.Array < CF decoding (using new zarr codecs) < open via "universal" zarr reader < chunk manifest < file

One question is how well does xarray's internal concept of a VariableCoder map onto a zarr codec?

The text was updated successfully, but these errors were encountered:

d-v-b · 2024-07-26T07:02:13Z

Thanks for the writeup Tom, a big +1 from me on this effort.

One question is how well does xarray's internal concept of a VariableCoder map onto a zarr codec?

From glancing at the signature and a few implementations, it looks like the VariableCoder is totally compatible with the v3 codecs. If I understand correctly, CF Variables are n-dimensional arrays, so we might be looking at translating these to ArrayArrayCodecs

TomNicholas · 2024-07-26T14:28:53Z

From glancing at the signature and a few implementations, it looks like the VariableCoder is totally compatible with the v3 codecs. If I understand correctly, CF Variables are n-dimensional arrays, so we might be looking at translating these to ArrayArrayCodecs

Does an ArrayArrayCodec know about the names of dimensions? Or metadata attributes (i.e. .zmetadata)? Because the VariableCoder has access to that information, as it is stored on the xarray.Variable object passed in.

TomNicholas mentioned this issue Aug 27, 2024

Replace this package with a VirtualiZarr reader? MITgcm/xmitgcm#337

Open

LDeakin mentioned this issue Sep 22, 2024

Experimental chunk manifest support LDeakin/zarrs#79

Draft

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs #303

Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs #303

TomNicholas commented Jul 26, 2024 •

edited

Loading

d-v-b commented Jul 26, 2024

TomNicholas commented Jul 26, 2024

Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs #303

Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs #303

Comments

TomNicholas commented Jul 26, 2024 • edited Loading

Idea: Use zarr readers to open and decode netCDF/HDF/etc. data without xarray by lifting xarray's decoding machinery out as new zarr codecs.

d-v-b commented Jul 26, 2024

TomNicholas commented Jul 26, 2024

TomNicholas commented Jul 26, 2024 •

edited

Loading