-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
array API standardSupport for the Python array API standardSupport for the Python array API standardbugcontrib-help-wantedtopic-combinecombine/concat/mergecombine/concat/mergetopic-indexingtopic-internalstopic-lazy arraytopic-zarrRelated to zarr storage libraryRelated to zarr storage library
Description
What is your issue?
In fsspec/kerchunk#377 the idea came up of using the xarray API to concatenate arrays which represent parts of a zarr store - i.e. using xarray to kerchunk a large set of netCDF files instead of using kerchunk.combine.MultiZarrToZarr
.
The idea is to make something like this work for kerchunking sets of netCDF files into zarr stores
ds = xr.open_mfdataset(
'/my/files*.nc'
engine='kerchunk', # kerchunk registers an xarray IO backend that returns zarr.Array objects
combine='nested', # 'by_coords' would require actually reading coordinate data
parallel=True, # would use dask.delayed to generate reference dicts for each file in parallel
)
ds # now wraps a bunch of zarr.Array / kerchunk.Array objects, no need for dask arrays
ds.kerchunk.to_zarr(store='out.zarr') # kerchunk defines an xarray accessor that extracts the zarr arrays and serializes them (which could also be done in parallel if writing to parquet)
I had a go at doing this in this notebook, and in doing so discovered a few potential issues with xarray's internals.
For this to work xarray has to:
- Wrap a
kerchunk.Array
object which barely defines any array API methods, including basically not supporting indexing at all, - Store all the information present in a kerchunked Zarr store but without ever loading any data,
- Not create any indexes by default during dataset construction or during
xr.concat
, - Not try to do anything else that can't be defined for a
kerchunk.Array
. - Possibly we need the Lazy Indexing classes to support concatenation Lazy concatenation of arrays #4628
It's an interesting exercise in using xarray as an abstraction, with no access to real numerical values at all.
mpiannucci
Metadata
Metadata
Assignees
Labels
array API standardSupport for the Python array API standardSupport for the Python array API standardbugcontrib-help-wantedtopic-combinecombine/concat/mergecombine/concat/mergetopic-indexingtopic-internalstopic-lazy arraytopic-zarrRelated to zarr storage libraryRelated to zarr storage library
Type
Projects
Status
Done