Splitting out lazy indexing layer and backends layer as zarr-python features #9281
Labels
enhancement
topic-arrays
related to flexible array support
topic-backends
topic-chunked-arrays
Managing different chunked backends, e.g. dask
topic-indexing
topic-internals
topic-lazy array
topic-zarr
Related to zarr storage library
What is your issue?
tl;dr: Could we factor out all of xarray's lazy indexing + backends as fully-featured virtual lazy zarr arrays?
When you do
xr.open_dataset
, a few main things happen:shape
anddtype
)When you do
virtualizarr.open_virtual_dataset
then also:ManifestArray
), which also knows theshape
anddtype
.In zarr-developers/zarr-specs#303 we've suggested that instead of various xarray backends instead (1) and (2) could be handled by zarr + chunk manifests + cf-specific zarr codecs.
For (3), note that currently we have lazy indexing in Xarray but not lazy concatenation, and in VirtualiZarr we kind of have lazy chunk-level concatenation without lazy indexing.
(4) is currently implemented separately from zarr-python in virtualizarr, but also notice that a
virtualizarr.ManifestArray
has all the information needed to actually go fetch data - in other words it could be converted directly to an actualzarr.Array
(mentioned by @ayushnag in zarr-developers/VirtualiZarr#124).Imagine that we enabled the
zarr.Array
type (or some newVirtualZarrArray
type) to do both indexing and concatenation lazily (proposed in zarr-developers/zarr-python#1603), and open netCDF / other files via the chunk manifest (see zarr-developers/zarr-specs#287). It could also write out just its metadata to disk via the chunk manifest ZEP. This would then:virtualizarr.ManifestArray
,The result would be that xarray users would basically open data (netCDF or zarr or otherwise) and see
VirtualZarrArray
s wrapped by Xarray. They could then do lazy operations as they do now, and either load actual values via.compute
or save only the lazy metadata representation to disk as a virtual zarr store (i.e. whatvirtualizarr
does right now). The latter could be created by special serialization functions that understand how to translate a chain of lazy Zarr array operations into a valid metadata-only zarr-compliant format on-disk, or you could even imagineds.to_zarr
having a booleanvirtual
kwarg to cover both cases.The lazy layer could either be implemented either inside zarr or live on top of it and be importable from other packages (i.e. #5081, see also data-apis/array-api#777).
All together this would give you:
Zarr arrays that can open and decode netCDF directly (a la Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303)
Lazy Zarr arrays even without Xarray
Ability to save virtual datasets without needing a dedicated
ManifestArray
type (i.e. the lazy concatenation functionality of VirtualiZarr in zarr-python itself)Separation of the metadata-reading logic of kerchunk/VirtualiZarr from the lazy concatenation stuff, so VirtualiZarr gets demoted to just being a repository for readers for specific file formats and codecs for them.
Complete separation of:
zarr.Array
),ds.to_zarr(path, virtual=True)
callingVirtualZarrArray
).The main subtlety I see here is selection in index-space vs chunk-space - xarray does the former but VirtualiZarr does the latter (see also zarr-developers/VirtualiZarr#183). This is what @d-v-d was getting at in zarr-developers/VirtualiZarr#71.
Whilst this is a longer-term roadmap idea, now is the time to think about it because of the malleability of zarr-python right now (e.g. zarr-developers/zarr-python#2052).
cc @dcherian @jhamman @joshmoore @sharkinsspatial @abarciauskas-bgse
The text was updated successfully, but these errors were encountered: