Splitting out lazy indexing layer and backends layer as zarr-python features

### What is your issue?

### **tl;dr: Could we factor out all of xarray's lazy indexing + backends as fully-featured virtual lazy zarr arrays?**

When you do `xr.open_dataset`, a few main things happen:

1) the data on disk is examined and a lazy representation built (which knows the data's `shape` and `dtype`)
2) decoding steps (following CF conventions) are set up ready to happen upon materialization of bytes
3) materialization of bytes is delayed by xarray's intermediate lazy indexing classes, which build a representation of successive slicing operations

When you do `virtualizarr.open_virtual_dataset` then also:

4) a chunk-level metadata-only lazy representation of data on-disk is created (the "chunk Manifest" inside the `ManifestArray`), which also knows the `shape` and `dtype`.

In https://github.com/zarr-developers/zarr-specs/issues/303 we've suggested that instead of various xarray backends instead (1) and (2) could be handled by zarr + chunk manifests + cf-specific zarr codecs.

For (3), note that currently we have lazy indexing in Xarray but not lazy concatenation, and in VirtualiZarr we kind of have lazy chunk-level concatenation without lazy indexing.

(4) is currently implemented separately from zarr-python in [virtualizarr](https://github.com/zarr-developers/VirtualiZarr), but also notice that a `virtualizarr.ManifestArray` has all the information needed to actually go fetch data - in other words it could be converted directly to an actual `zarr.Array` (mentioned by @ayushnag in https://github.com/zarr-developers/VirtualiZarr/issues/124).

---

Imagine that we enabled the `zarr.Array` type (or some new `VirtualZarrArray` type) to do both indexing and concatenation lazily (proposed in https://github.com/zarr-developers/zarr-python/discussions/1603), and open netCDF / other files via the chunk manifest (see https://github.com/zarr-developers/zarr-specs/issues/287). It could also write out just its metadata to disk via the chunk manifest ZEP. This would then:

- Basically replace the `virtualizarr.ManifestArray`,
- Be wrapped by Xarray to provide both the "universal reader" of https://github.com/zarr-developers/zarr-specs/issues/303 and also lazy slicing & concatenation operations (see https://github.com/pydata/xarray/issues/5081).

The result would be that xarray users would basically open data (netCDF or zarr or otherwise) and see `VirtualZarrArray`s wrapped by Xarray. They could then do lazy operations as they do now, and either load actual values via `.compute` or save only the lazy metadata representation to disk as a virtual zarr store (i.e. what `virtualizarr` does right now). The latter could be created by special serialization functions that understand how to translate a chain of lazy Zarr array operations into a valid metadata-only zarr-compliant format on-disk, or you could even imagine `ds.to_zarr` having a boolean `virtual` kwarg to cover both cases.

The lazy layer could either be implemented either inside zarr or live on top of it and be importable from other packages (i.e. https://github.com/pydata/xarray/issues/5081, see also https://github.com/data-apis/array-api/discussions/777).

All together this would give you:

1) Zarr arrays that can open and decode netCDF directly (a la https://github.com/zarr-developers/zarr-specs/issues/303)

2) Lazy Zarr arrays even without Xarray

3) Ability to save virtual datasets without needing a dedicated `ManifestArray` type (i.e. the lazy concatenation functionality of VirtualiZarr in zarr-python itself)

4) Separation of the metadata-reading logic of kerchunk/VirtualiZarr from the lazy concatenation stuff, so VirtualiZarr gets demoted to just being a repository for readers for specific file formats and codecs for them.

5) Complete separation of:
- finding byte ranges from archival formats (VirtualiZarr / kerchunk readers for specific file formats), 
- reading bytes (`zarr.Array`), 
- decoding bytes following CF (new CF zarr codecs mentioned in https://github.com/zarr-developers/zarr-specs/issues/303 and https://github.com/pydata/xarray/issues/155),
- lazy operations (new lazy operations package),
- handling of named variables / dimensions (Xarray), 
- serialization to metadata-only virtual Zarr store (`ds.to_zarr(path, virtual=True)` calling `VirtualZarrArray`).

The main subtlety I see here is selection in index-space vs chunk-space - xarray does the former but VirtualiZarr does the latter (see also https://github.com/zarr-developers/VirtualiZarr/pull/183). This is what @d-v-d was getting at in https://github.com/zarr-developers/VirtualiZarr/issues/71.

Whilst this is a longer-term roadmap idea, now is the time to think about it because of the malleability of zarr-python right now (e.g. https://github.com/zarr-developers/zarr-python/discussions/2052).

cc @dcherian @jhamman @joshmoore @sharkinsspatial @abarciauskas-bgse

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Splitting out lazy indexing layer and backends layer as zarr-python features #9281

What is your issue?

tl;dr: Could we factor out all of xarray's lazy indexing + backends as fully-featured virtual lazy zarr arrays?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Splitting out lazy indexing layer and backends layer as zarr-python features #9281

Description

What is your issue?

tl;dr: Could we factor out all of xarray's lazy indexing + backends as fully-featured virtual lazy zarr arrays?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions