Skip to content

Automatically create xindex? #9703

Open
@max-sixty

Description

@max-sixty

Is your feature request related to a problem?

I'm trying to use xindex more. Currently, trying to select values using coordinates that haven't been explicitly indexed via set_xindex() raises:

ds = xr.tutorial.open_dataset("air_temperature").assign_coords(lat2=lambda x: x.lat)

ds
# Output:
<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
    lat2     (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
Data variables:
    air      (time, lat, lon) float64 31MB ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

# Attempting to select using the unindexed coordinate raises an error:
ds.sel(lat2=75)
# Output:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[20], line 1
----> 1 ds.sel(lat2=75)

File ~/workspace/xarray/xarray/core/dataset.py:3223, in Dataset.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   3155 """Returns a new dataset with each array indexed by tick labels
   3156 along the specified dimension(s).
   3157
   (...)
   3220
   3221 """
   3222 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 3223 query_results = map_index_queries(
   3224     self, indexers=indexers, method=method, tolerance=tolerance
   3225 )
   3227 if drop:
   3228     no_scalar_variables = {}

File ~/workspace/xarray/xarray/core/indexing.py:186, in map_index_queries(obj, indexers, method, tolerance, **indexers_kwargs)
    183     options = {"method": method, "tolerance": tolerance}
    185 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "map_index_queries")
--> 186 grouped_indexers = group_indexers_by_index(obj, indexers, options)
    188 results = []
    189 for index, labels in grouped_indexers:

File ~/workspace/xarray/xarray/core/indexing.py:145, in group_indexers_by_index(obj, indexers, options)
    143     grouped_indexers[index_id][key] = label
    144 elif key in obj.coords:
--> 145     raise KeyError(f"no index found for coordinate {key!r}")
    146 elif key not in obj.dims:
    147     raise KeyError(
    148         f"{key!r} is not a valid dimension or coordinate for "
    149         f"{obj.__class__.__name__} with dimensions {obj.dims!r}"
    150     )

KeyError: "no index found for coordinate 'lat2'"

After explicitly setting the index, it works as expected:

ds.set_xindex('lat2').sel(lat2=75)
# Output:
<xarray.Dataset> Size: 1MB
Dimensions:  (time: 2920, lon: 53)
Coordinates:
    lat      float32 4B 75.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
    lat2     float32 4B 75.0
Data variables:
    air      (time, lon) float64 1MB ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

It's a bit annoying — frequently I attempt to select something, realize it doesn't have an index, add the .set_xindex call, try and remember to add each one at object creation, feel like xarray isn't being as helpful as it could be.

Describe the solution you'd like

Could we instead set the xindex automatically when calling .sel

Possibly we want to force the user to create this once, rather than paying the cost of creating a new index on each call? But OTOH it seems relatively cheap?

%timeit ds.assign_coords(lat2=ds.lat + 2).set_xindex('lat2')

349 µs ± 6.97 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(I guess it could be possible to update a cache in place, and then creating a new index from the cache would be very cheap. Though also possibly that's a source of quite confusing behavior if our implementation is in any way wrong / people are sharing objects across threads etc — i.e. the principle of "don't update in place" is useful)

Describe alternatives you've considered

A set_xindex(...) param (i.e. literally an ellipsis ...) that just creates all the indexes that it can, and folks could call after creating an object?

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions