Description
"Xarray has some secret private classes for lazily indexing / wrapping arrays that are so useful I think they should be broken out into a standalone package. https://github.com/pydata/xarray/blob/master/xarray/core/indexing.py#L516"
The idea here is create a first-class "duck array" library for lazy indexing that could replace xarray's internal classes for lazy indexing. This would be in some ways similar to dask.array, but much simpler, because it doesn't have to worry about parallel computing.
Desired features:
- Lazy indexing
- Lazy transposes
- Lazy concatenation (Lazy concatenation of arrays #4628) and stacking
- Lazy vectorized operations (e.g., unary and binary arithmetic)
- needed for decoding variables from disk (
xarray.encoding
) and - building lazy multi-dimensional coordinate arrays corresponding to map projections (Idea: functionally-derived non-dimensional coordinates #3620)
- needed for decoding variables from disk (
- Maybe: lazy reshapes (xarray.DataArray.stack load data into memory #4113)
A common feature of these operations is they can (and almost always should) be fused with indexing: if N elements are selected via indexing, only O(N) compute and memory is required to produce them, regards of the size of the original arrays as long as the number of applied operations can be treated as a constant. Memory access is significantly slower than compute on modern hardware, so recomputing these operations on the fly is almost always a good idea.
Out of scope: lazy computation when indexing could require access to many more elements to compute the desired value than are returned. For example, mean()
probably should not be lazy, because that could involve computation of a very large number of elements that one might want to cache.
This is valuable functionality for Xarray for two reasons:
- It allows for "previewing" small bits of data loaded from disk or remote storage, even if that data needs some form of cheap "decoding" from its form on disk.
- It allows for xarray to decode data in a lazy fashion that is compatible with full-featured systems for lazy computation (e.g., Dask), without requiring the user to choose dask when reading the data.
Related issues:
- [Proposal] Expose Variable without Pandas dependency [Proposal] Expose Variable without Pandas dependency #3981
- Lazy concatenation of arrays Lazy concatenation of arrays #4628