Description
What is your issue?
tl;dr: Could we factor out all of xarray's lazy indexing + backends as fully-featured virtual lazy zarr arrays?
When you do xr.open_dataset
, a few main things happen:
- the data on disk is examined and a lazy representation built (which knows the data's
shape
anddtype
) - decoding steps (following CF conventions) are set up ready to happen upon materialization of bytes
- materialization of bytes is delayed by xarray's intermediate lazy indexing classes, which build a representation of successive slicing operations
When you do virtualizarr.open_virtual_dataset
then also:
- a chunk-level metadata-only lazy representation of data on-disk is created (the "chunk Manifest" inside the
ManifestArray
), which also knows theshape
anddtype
.
In zarr-developers/zarr-specs#303 we've suggested that instead of various xarray backends instead (1) and (2) could be handled by zarr + chunk manifests + cf-specific zarr codecs.
For (3), note that currently we have lazy indexing in Xarray but not lazy concatenation, and in VirtualiZarr we kind of have lazy chunk-level concatenation without lazy indexing.
(4) is currently implemented separately from zarr-python in virtualizarr, but also notice that a virtualizarr.ManifestArray
has all the information needed to actually go fetch data - in other words it could be converted directly to an actual zarr.Array
(mentioned by @ayushnag in zarr-developers/VirtualiZarr#124).
Imagine that we enabled the zarr.Array
type (or some new VirtualZarrArray
type) to do both indexing and concatenation lazily (proposed in zarr-developers/zarr-python#1603), and open netCDF / other files via the chunk manifest (see zarr-developers/zarr-specs#287). It could also write out just its metadata to disk via the chunk manifest ZEP. This would then:
- Basically replace the
virtualizarr.ManifestArray
, - Be wrapped by Xarray to provide both the "universal reader" of Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303 and also lazy slicing & concatenation operations (see Lazy indexing arrays as a stand-alone package #5081).
The result would be that xarray users would basically open data (netCDF or zarr or otherwise) and see VirtualZarrArray
s wrapped by Xarray. They could then do lazy operations as they do now, and either load actual values via .compute
or save only the lazy metadata representation to disk as a virtual zarr store (i.e. what virtualizarr
does right now). The latter could be created by special serialization functions that understand how to translate a chain of lazy Zarr array operations into a valid metadata-only zarr-compliant format on-disk, or you could even imagine ds.to_zarr
having a boolean virtual
kwarg to cover both cases.
The lazy layer could either be implemented either inside zarr or live on top of it and be importable from other packages (i.e. #5081, see also data-apis/array-api#777).
All together this would give you:
-
Zarr arrays that can open and decode netCDF directly (a la Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303)
-
Lazy Zarr arrays even without Xarray
-
Ability to save virtual datasets without needing a dedicated
ManifestArray
type (i.e. the lazy concatenation functionality of VirtualiZarr in zarr-python itself) -
Separation of the metadata-reading logic of kerchunk/VirtualiZarr from the lazy concatenation stuff, so VirtualiZarr gets demoted to just being a repository for readers for specific file formats and codecs for them.
-
Complete separation of:
- finding byte ranges from archival formats (VirtualiZarr / kerchunk readers for specific file formats),
- reading bytes (
zarr.Array
), - decoding bytes following CF (new CF zarr codecs mentioned in Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303 and Expose a public interface for CF encoding/decoding functions #155),
- lazy operations (new lazy operations package),
- handling of named variables / dimensions (Xarray),
- serialization to metadata-only virtual Zarr store (
ds.to_zarr(path, virtual=True)
callingVirtualZarrArray
).
The main subtlety I see here is selection in index-space vs chunk-space - xarray does the former but VirtualiZarr does the latter (see also zarr-developers/VirtualiZarr#183). This is what @d-v-d was getting at in zarr-developers/VirtualiZarr#71.
Whilst this is a longer-term roadmap idea, now is the time to think about it because of the malleability of zarr-python right now (e.g. zarr-developers/zarr-python#2052).
cc @dcherian @jhamman @joshmoore @sharkinsspatial @abarciauskas-bgse