Skip to content

indexing by a list of slices #10479

Open
Open
@keewis

Description

@keewis

Is your feature request related to a problem?

In some situations, it is important to be able to select multiple (disconnected) contiguous regions along a dimension, even if the dimension itself is very large.

If there's enough client memory, it is possible to emulate this by materializing the slices into an array of integers, but this becomes infeasible if that array is too large (and while supposedly we should be able to index by a dask integer array, I'm not sure how efficient that would be).

Examples of where this would be useful include:

  • The healpix MOC index at healpix moc index xarray-contrib/xdggs#151, where cell ids are represented as a set of disconnected ranges at the smallest possible refinement level. To be able to support Index.sel, I'd need to return a IndexSelResult with either a list of slices, or materialize these into an integer array and error out if that wouldn't fit into memory (or try to use dask as an indexer).
  • @tomwhite's use-case of selecting disconnected regions in a genome (see Add bcftools-style filtering sgkit-dev/sgkit#1330 (comment)). I'll let him provide further details.

cc @benbovy, @shoyer, @TomNicholas, @dcherian

Describe the solution you'd like

I'd love to be able to specify this as another kind of indexer:

indexer = SliceSet([slice(20, 5000), slice(12078432, 1850372894)])
ds.isel(cells=indexer)

but that will obviously further increase the complexity of the indexing machinery

Describe alternatives you've considered

Manually iterating of the slices, then concatenating the result is possible, but will have an additional overhead if done using the xarray API. However, I don't see a way that can work as part of IndexSelResult.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions