Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrapping a kerchunk.Array object directly with xarray #8699

Closed
TomNicholas opened this issue Feb 3, 2024 · 4 comments
Closed

Wrapping a kerchunk.Array object directly with xarray #8699

TomNicholas opened this issue Feb 3, 2024 · 4 comments
Labels
array API standard Support for the Python array API standard bug contrib-help-wanted topic-combine combine/concat/merge topic-indexing topic-internals topic-lazy array topic-zarr Related to zarr storage library

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Feb 3, 2024

What is your issue?

In fsspec/kerchunk#377 the idea came up of using the xarray API to concatenate arrays which represent parts of a zarr store - i.e. using xarray to kerchunk a large set of netCDF files instead of using kerchunk.combine.MultiZarrToZarr.

The idea is to make something like this work for kerchunking sets of netCDF files into zarr stores

ds = xr.open_mfdataset(
    '/my/files*.nc'
    engine='kerchunk',  # kerchunk registers an xarray IO backend that returns zarr.Array objects
    combine='nested',  # 'by_coords' would require actually reading coordinate data
    parallel=True,  # would use dask.delayed to generate reference dicts for each file in parallel
)

ds  # now wraps a bunch of zarr.Array / kerchunk.Array objects, no need for dask arrays

ds.kerchunk.to_zarr(store='out.zarr')  # kerchunk defines an xarray accessor that extracts the zarr arrays and serializes them (which could also be done in parallel if writing to parquet)

I had a go at doing this in this notebook, and in doing so discovered a few potential issues with xarray's internals.

For this to work xarray has to:

  • Wrap a kerchunk.Array object which barely defines any array API methods, including basically not supporting indexing at all,
  • Store all the information present in a kerchunked Zarr store but without ever loading any data,
  • Not create any indexes by default during dataset construction or during xr.concat,
  • Not try to do anything else that can't be defined for a kerchunk.Array.
  • Possibly we need the Lazy Indexing classes to support concatenation Lazy concatenation of arrays #4628

It's an interesting exercise in using xarray as an abstraction, with no access to real numerical values at all.

@TomNicholas TomNicholas added bug contrib-help-wanted topic-internals topic-indexing topic-zarr Related to zarr storage library topic-combine combine/concat/merge topic-lazy array array API standard Support for the Python array API standard labels Feb 3, 2024
@TomNicholas
Copy link
Member Author

TomNicholas commented Feb 3, 2024

One issue that came up around not being able to avoid creating indexes for 1D coordinates (@benbovy): #8107 (comment)

@TomNicholas
Copy link
Member Author

TomNicholas commented Feb 3, 2024

Another one is why does opening a dataset require indexing into it?

If I add a print(indexer) inside Variable.__getitem__ here

data = as_indexable(self._data)[indexer]

this happens:

In [3]: xr.tutorial.open_dataset('air_temperature')
Out[3]: BasicIndexer((slice(None, 13, None),))
BasicIndexer((slice(-12, None, None),))
BasicIndexer((slice(None, 14, None),))
BasicIndexer((slice(-13, None, None),))
BasicIndexer((slice(None, 12, None),))
BasicIndexer((slice(-11, None, None),))

why is any of that indexing happening?

EDIT: Oh it's happening when the ds gets printed...

EDIT2: False alarm about this one 😅

@TomNicholas
Copy link
Member Author

It's interesting to define a minimal ConcatenatableArray and see what is required to get that to work inside xarray. Again we get problems with indexes being built unexpectedly.

Notebook for ConcatenatableArray here

@TomNicholas
Copy link
Member Author

Closing this as successfully implemented in the virtualizarr.ManifestArray class. It requires defining some methods that are no-ops, which xarray could in theory recognize are no-ops and avoid calling at all, but it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array API standard Support for the Python array API standard bug contrib-help-wanted topic-combine combine/concat/merge topic-indexing topic-internals topic-lazy array topic-zarr Related to zarr storage library
Projects
Development

No branches or pull requests

1 participant