Skip to content

Splitting out lazy indexing layer and backends layer as zarr-python features #9281

Open
@TomNicholas

Description

@TomNicholas

What is your issue?

tl;dr: Could we factor out all of xarray's lazy indexing + backends as fully-featured virtual lazy zarr arrays?

When you do xr.open_dataset, a few main things happen:

  1. the data on disk is examined and a lazy representation built (which knows the data's shape and dtype)
  2. decoding steps (following CF conventions) are set up ready to happen upon materialization of bytes
  3. materialization of bytes is delayed by xarray's intermediate lazy indexing classes, which build a representation of successive slicing operations

When you do virtualizarr.open_virtual_dataset then also:

  1. a chunk-level metadata-only lazy representation of data on-disk is created (the "chunk Manifest" inside the ManifestArray), which also knows the shape and dtype.

In zarr-developers/zarr-specs#303 we've suggested that instead of various xarray backends instead (1) and (2) could be handled by zarr + chunk manifests + cf-specific zarr codecs.

For (3), note that currently we have lazy indexing in Xarray but not lazy concatenation, and in VirtualiZarr we kind of have lazy chunk-level concatenation without lazy indexing.

(4) is currently implemented separately from zarr-python in virtualizarr, but also notice that a virtualizarr.ManifestArray has all the information needed to actually go fetch data - in other words it could be converted directly to an actual zarr.Array (mentioned by @ayushnag in zarr-developers/VirtualiZarr#124).


Imagine that we enabled the zarr.Array type (or some new VirtualZarrArray type) to do both indexing and concatenation lazily (proposed in zarr-developers/zarr-python#1603), and open netCDF / other files via the chunk manifest (see zarr-developers/zarr-specs#287). It could also write out just its metadata to disk via the chunk manifest ZEP. This would then:

The result would be that xarray users would basically open data (netCDF or zarr or otherwise) and see VirtualZarrArrays wrapped by Xarray. They could then do lazy operations as they do now, and either load actual values via .compute or save only the lazy metadata representation to disk as a virtual zarr store (i.e. what virtualizarr does right now). The latter could be created by special serialization functions that understand how to translate a chain of lazy Zarr array operations into a valid metadata-only zarr-compliant format on-disk, or you could even imagine ds.to_zarr having a boolean virtual kwarg to cover both cases.

The lazy layer could either be implemented either inside zarr or live on top of it and be importable from other packages (i.e. #5081, see also data-apis/array-api#777).

All together this would give you:

  1. Zarr arrays that can open and decode netCDF directly (a la Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303)

  2. Lazy Zarr arrays even without Xarray

  3. Ability to save virtual datasets without needing a dedicated ManifestArray type (i.e. the lazy concatenation functionality of VirtualiZarr in zarr-python itself)

  4. Separation of the metadata-reading logic of kerchunk/VirtualiZarr from the lazy concatenation stuff, so VirtualiZarr gets demoted to just being a repository for readers for specific file formats and codecs for them.

  5. Complete separation of:

The main subtlety I see here is selection in index-space vs chunk-space - xarray does the former but VirtualiZarr does the latter (see also zarr-developers/VirtualiZarr#183). This is what @d-v-d was getting at in zarr-developers/VirtualiZarr#71.

Whilst this is a longer-term roadmap idea, now is the time to think about it because of the malleability of zarr-python right now (e.g. zarr-developers/zarr-python#2052).

cc @dcherian @jhamman @joshmoore @sharkinsspatial @abarciauskas-bgse

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions