Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding vectorized indexing docs #4711

Merged
merged 10 commits into from
Feb 16, 2021
16 changes: 16 additions & 0 deletions doc/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,22 @@ These methods may also be applied to ``Dataset`` objects
ds = da.to_dataset(name="bar")
ds.isel(x=xr.DataArray([0, 1, 2], dims=["points"]))

Vectorized indexing may be used to extract information from the nearest
grid cells of interest, for example, the nearest climate model grid
cells to a collection specified weather station latitudes and longitudes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did tell you to modify the vectorized indexing section, but while reading #3768 I noticed that we already document that in the More advanced indexing section. Given that it uses pointwise indexing as its only example, maybe we should rename it to Pointwise indexing to make this easier to find?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like that section should be merged with "Vectorized Indexing" which could then be renamed to "Pointwise (or vectorized) Indexing)"?

Copy link
Collaborator

@keewis keewis Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. Also, the term vectorized indexing is not really explained (unless I'm missing something, there's only a comment hinting at it's meaning), so it might be good to explicitly do that somewhere. This sounds like a lot of work, though, so it might be better to do that in a different PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting discussion. At this point, I don't think I have much to add here. So I concur that this should be addressed in a different PR.


.. ipython:: python

ds = xr.tutorial.open_dataset("air_temperature")

# Define target latitude and longitude (where weather stations might be)
tgt_lat = xr.DataArray(np.linspace(40, 45, num=6), dims="points")
dcherian marked this conversation as resolved.
Show resolved Hide resolved
tgt_lon = xr.DataArray(np.linspace(200, 205, num=6), dims="points")

# Retrieve data at the grid cells nearest to the target latitudes and longitudes
da = ds['air'].sel(lon=tgt_lon, lat=tgt_lat, method='nearest')
mathause marked this conversation as resolved.
Show resolved Hide resolved
da

.. tip::

If you are lazily loading your data from disk, not every form of vectorized
Expand Down
49 changes: 49 additions & 0 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -1195,6 +1195,55 @@ def sel(
Dataset.sel
DataArray.isel

Examples
--------
>>> ds = xr.tutorial.open_dataset("air_temperature")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually prefer having example data generated by np.arange or np.linspace (or by hand) because that makes it much easier to figure out what the example does. Also, this is the DataArray docstring so explicitly creating a DataArray might be better?

Suggested change
>>> ds = xr.tutorial.open_dataset("air_temperature")
>>> da = xr.DataArray(
... np.arange(20).reshape(5, 10),
... coords={"x": np.arange(5), "y": np.arange(10)},
... dims=("x", "y"),
... )

The doctests / documentation fail because of the repr of the tutorial dataset (the "description" attribute contains a \n (newline) character), so with synthetic data we would avoid that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip. I've implemented your suggestion in DataArray.sel. For DataArray.isel method='nearest doesn't appear to be an option? So I tried this, but it doesn't seem to work. I'm not sure why?

da = xr.DataArray(
                np.arange(25).reshape(5, 5),
                coords={"x": np.arange(5), "y": np.arange(5)},
                dims=("x", "y")
            )
tgt_x = xr.DataArray(np.linspace(0, 2, num=3), dims="points")
tgt_y = xr.DataArray(np.linspace(0, 2, num=3), dims="points")

da.isel(x=tgt_x, y=tgt_y) 

Copy link
Collaborator

@keewis keewis Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isel expects int, but linspace returns float. Since you're using it to generate indices, I would use arange instead (you could also use the dtype parameter of linspace).

Edit: yes, it seems isel does not support method (I guess that's to stay close to numpy indexing). Also, since this is isel, you don't need to pass coords.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on building an example for Dataset.sel using method='nearest'

import numpy as np
import pandas as pd
import xarray as xr

ds = xr.Dataset(
	data_vars=dict(
			temperature=(["x", "y", "time"], np.arange(125).reshape(5, 5, 5))
		),
		coords=dict(
			lon=(["x", "y"], np.arange(25).reshape(5, 5)),
			lat=(["x", "y"], np.arange(25).reshape(5, 5)),
			time=pd.date_range("2014-09-06", periods=5),
		),
		attrs=dict(description="Weather related data.")
)

# Define target latitude and longitude
tgt_x = xr.DataArray(np.arange(0, 5), dims="points")
tgt_y = xr.DataArray(np.arange(0, 5), dims="points")

# Dataset.sel
ds_sel = ds.sel(x=tgt_x, y=tgt_y, method='nearest')

However I get the following error, which I don't understand. I assume my coordinate definitions are inconsistent?

Traceback (most recent call last):
  File "/Users/erke2265/Documents/xarray/xarray/core/indexing.py", line 257, in remap_label_indexers
    index = data_obj.indexes[dim]
  File "/Users/erke2265/Documents/xarray/xarray/core/indexes.py", line 64, in __getitem__
    return self._indexes[key]
KeyError: 'x'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tmp.py", line 24, in <module>
    ds_sel = ds.sel(x=tgt_x, y=tgt_y, method='nearest')
  File "/Users/erke2265/Documents/xarray/xarray/core/dataset.py", line 2275, in sel
    self, indexers=indexers, method=method, tolerance=tolerance
  File "/Users/erke2265/Documents/xarray/xarray/core/coordinates.py", line 398, in remap_label_indexers
    obj, v_indexers, method=method, tolerance=tolerance
  File "/Users/erke2265/Documents/xarray/xarray/core/indexing.py", line 262, in remap_label_indexers
    "cannot supply ``method`` or ``tolerance`` "
ValueError: cannot supply ``method`` or ``tolerance`` when the indexed dimension does not have an associated coordinate.

Copy link
Collaborator

@keewis keewis Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason is that you define lat and lon as 2D non-dimension coordinates and x and y as dimensions without coordinates. As the error says, sel with method or tolerance only works with dimensions with coordinates, so you need to add coordinates to x and y (actually, I would have expected sel to fail on dimensions without coordinates even without method or tolerance, so I'm a bit surprised the error message implies that could work).

This also means that you don't have to define lat and lon since you won't be using them in this example (it's fine to keep it if you want to use them in later examples, though).

>>> ds
<xarray.Dataset>
Dimensions: (lat: 25, lon: 53, time: 2920)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 ...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

>>> tgt_lat = xr.DataArray(np.linspace(40, 45, num=6), dims="points")
>>> tgt_lon = xr.DataArray(np.linspace(200, 205, num=6), dims="points")
>>> da = ds['air'].sel(lon=tgt_lon, lat=tgt_lat, method='nearest')
>>> da
<xarray.DataArray 'air' (time: 2920, points: 6)>
array([[284.6 , 284.6 , 283.19998, 283.19998, 280.19998, 280.19998],
[283.29 , 283.29 , 282.79 , 282.79 , 280.79 , 280.79 ],
[282. , 282. , 280.79 , 280.79 , 280. , 280. ],
...,
[282.49 , 282.49 , 281.29 , 281.29 , 280.49 , 280.49 ],
[282.09 , 282.09 , 280.38998, 280.38998, 279.49 , 279.49 ],
[282.09 , 282.09 , 280.59 , 280.59 , 279.19 , 279.19 ]],
dtype=float32)
Coordinates:
lat (points) float32 40.0 40.0 42.5 42.5 45.0 45.0
lon (points) float32 200.0 200.0 202.5 202.5 205.0 205.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Dimensions without coordinates: points
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]
"""
ds = self._to_temp_dataset().sel(
indexers=indexers,
Expand Down