Wrapping a kerchunk.Array
object directly with xarray #8699
Closed
Description
What is your issue?
In fsspec/kerchunk#377 the idea came up of using the xarray API to concatenate arrays which represent parts of a zarr store - i.e. using xarray to kerchunk a large set of netCDF files instead of using kerchunk.combine.MultiZarrToZarr
.
The idea is to make something like this work for kerchunking sets of netCDF files into zarr stores
ds = xr.open_mfdataset(
'/my/files*.nc'
engine='kerchunk', # kerchunk registers an xarray IO backend that returns zarr.Array objects
combine='nested', # 'by_coords' would require actually reading coordinate data
parallel=True, # would use dask.delayed to generate reference dicts for each file in parallel
)
ds # now wraps a bunch of zarr.Array / kerchunk.Array objects, no need for dask arrays
ds.kerchunk.to_zarr(store='out.zarr') # kerchunk defines an xarray accessor that extracts the zarr arrays and serializes them (which could also be done in parallel if writing to parquet)
I had a go at doing this in this notebook, and in doing so discovered a few potential issues with xarray's internals.
For this to work xarray has to:
- Wrap a
kerchunk.Array
object which barely defines any array API methods, including basically not supporting indexing at all, - Store all the information present in a kerchunked Zarr store but without ever loading any data,
- Not create any indexes by default during dataset construction or during
xr.concat
, - Not try to do anything else that can't be defined for a
kerchunk.Array
. - Possibly we need the Lazy Indexing classes to support concatenation Lazy concatenation of arrays #4628
It's an interesting exercise in using xarray as an abstraction, with no access to real numerical values at all.
Metadata
Assignees
Type
Projects
Status
Done