Description
Is your feature request related to a problem?
In some situations, it is important to be able to select multiple (disconnected) contiguous regions along a dimension, even if the dimension itself is very large.
If there's enough client memory, it is possible to emulate this by materializing the slices into an array of integers, but this becomes infeasible if that array is too large (and while supposedly we should be able to index by a dask
integer array, I'm not sure how efficient that would be).
Examples of where this would be useful include:
- The healpix MOC index at healpix moc index xarray-contrib/xdggs#151, where cell ids are represented as a set of disconnected ranges at the smallest possible refinement level. To be able to support
Index.sel
, I'd need to return aIndexSelResult
with either a list of slices, or materialize these into an integer array and error out if that wouldn't fit into memory (or try to usedask
as an indexer). - @tomwhite's use-case of selecting disconnected regions in a genome (see Add
bcftools
-style filtering sgkit-dev/sgkit#1330 (comment)). I'll let him provide further details.
cc @benbovy, @shoyer, @TomNicholas, @dcherian
Describe the solution you'd like
I'd love to be able to specify this as another kind of indexer:
indexer = SliceSet([slice(20, 5000), slice(12078432, 1850372894)])
ds.isel(cells=indexer)
but that will obviously further increase the complexity of the indexing machinery
Describe alternatives you've considered
Manually iterating of the slices, then concatenating the result is possible, but will have an additional overhead if done using the xarray
API. However, I don't see a way that can work as part of IndexSelResult
.