Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding vectorized indexing docs #4711

Merged
merged 10 commits into from
Feb 16, 2021

Conversation

EricKeenan
Copy link
Contributor

@EricKeenan EricKeenan commented Dec 18, 2020

#4630: Adds a new vectorized indexing example to sel docstring and narrative docs.

Thanks to @dcherian for introducing me to vectorized indexing and @keewis for providing some information to get started. Also thanks to the community for the excellent contribution guide. http://xarray.pydata.org/en/stable/contributing.html

Am I missing anything here? Or is there anything that can be improved? I'm happy to see this through - thanks in advance for any feedback/tips!

Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks and welcome to xarray, @EricKeenan, this is definitely a valuable addition.

Could you also add these examples to the docstrings of DataArray.isel, Dataset.sel, and Dataset.isel?

@max-sixty, I think some time ago we were wondering if it was possible to link to the narrative documentation from docstrings. The documentation of the indexers parameter of sel contains such a link:

carried out. See :ref:`indexing` for the details.

but I guess this shouldn't be used too much because it looks strange with pydoc.

@@ -1195,6 +1195,55 @@ def sel(
Dataset.sel
DataArray.isel

Examples
--------
>>> ds = xr.tutorial.open_dataset("air_temperature")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually prefer having example data generated by np.arange or np.linspace (or by hand) because that makes it much easier to figure out what the example does. Also, this is the DataArray docstring so explicitly creating a DataArray might be better?

Suggested change
>>> ds = xr.tutorial.open_dataset("air_temperature")
>>> da = xr.DataArray(
... np.arange(20).reshape(5, 10),
... coords={"x": np.arange(5), "y": np.arange(10)},
... dims=("x", "y"),
... )

The doctests / documentation fail because of the repr of the tutorial dataset (the "description" attribute contains a \n (newline) character), so with synthetic data we would avoid that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip. I've implemented your suggestion in DataArray.sel. For DataArray.isel method='nearest doesn't appear to be an option? So I tried this, but it doesn't seem to work. I'm not sure why?

da = xr.DataArray(
                np.arange(25).reshape(5, 5),
                coords={"x": np.arange(5), "y": np.arange(5)},
                dims=("x", "y")
            )
tgt_x = xr.DataArray(np.linspace(0, 2, num=3), dims="points")
tgt_y = xr.DataArray(np.linspace(0, 2, num=3), dims="points")

da.isel(x=tgt_x, y=tgt_y) 

Copy link
Collaborator

@keewis keewis Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isel expects int, but linspace returns float. Since you're using it to generate indices, I would use arange instead (you could also use the dtype parameter of linspace).

Edit: yes, it seems isel does not support method (I guess that's to stay close to numpy indexing). Also, since this is isel, you don't need to pass coords.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on building an example for Dataset.sel using method='nearest'

import numpy as np
import pandas as pd
import xarray as xr

ds = xr.Dataset(
	data_vars=dict(
			temperature=(["x", "y", "time"], np.arange(125).reshape(5, 5, 5))
		),
		coords=dict(
			lon=(["x", "y"], np.arange(25).reshape(5, 5)),
			lat=(["x", "y"], np.arange(25).reshape(5, 5)),
			time=pd.date_range("2014-09-06", periods=5),
		),
		attrs=dict(description="Weather related data.")
)

# Define target latitude and longitude
tgt_x = xr.DataArray(np.arange(0, 5), dims="points")
tgt_y = xr.DataArray(np.arange(0, 5), dims="points")

# Dataset.sel
ds_sel = ds.sel(x=tgt_x, y=tgt_y, method='nearest')

However I get the following error, which I don't understand. I assume my coordinate definitions are inconsistent?

Traceback (most recent call last):
  File "/Users/erke2265/Documents/xarray/xarray/core/indexing.py", line 257, in remap_label_indexers
    index = data_obj.indexes[dim]
  File "/Users/erke2265/Documents/xarray/xarray/core/indexes.py", line 64, in __getitem__
    return self._indexes[key]
KeyError: 'x'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tmp.py", line 24, in <module>
    ds_sel = ds.sel(x=tgt_x, y=tgt_y, method='nearest')
  File "/Users/erke2265/Documents/xarray/xarray/core/dataset.py", line 2275, in sel
    self, indexers=indexers, method=method, tolerance=tolerance
  File "/Users/erke2265/Documents/xarray/xarray/core/coordinates.py", line 398, in remap_label_indexers
    obj, v_indexers, method=method, tolerance=tolerance
  File "/Users/erke2265/Documents/xarray/xarray/core/indexing.py", line 262, in remap_label_indexers
    "cannot supply ``method`` or ``tolerance`` "
ValueError: cannot supply ``method`` or ``tolerance`` when the indexed dimension does not have an associated coordinate.

Copy link
Collaborator

@keewis keewis Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason is that you define lat and lon as 2D non-dimension coordinates and x and y as dimensions without coordinates. As the error says, sel with method or tolerance only works with dimensions with coordinates, so you need to add coordinates to x and y (actually, I would have expected sel to fail on dimensions without coordinates even without method or tolerance, so I'm a bit surprised the error message implies that could work).

This also means that you don't have to define lat and lon since you won't be using them in this example (it's fine to keep it if you want to use them in later examples, though).

doc/indexing.rst Outdated
Comment on lines 398 to 400
Vectorized indexing may be used to extract information from the nearest
grid cells of interest, for example, the nearest climate model grid
cells to a collection specified weather station latitudes and longitudes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did tell you to modify the vectorized indexing section, but while reading #3768 I noticed that we already document that in the More advanced indexing section. Given that it uses pointwise indexing as its only example, maybe we should rename it to Pointwise indexing to make this easier to find?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like that section should be merged with "Vectorized Indexing" which could then be renamed to "Pointwise (or vectorized) Indexing)"?

Copy link
Collaborator

@keewis keewis Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. Also, the term vectorized indexing is not really explained (unless I'm missing something, there's only a comment hinting at it's meaning), so it might be good to explicitly do that somewhere. This sounds like a lot of work, though, so it might be better to do that in a different PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting discussion. At this point, I don't think I have much to add here. So I concur that this should be addressed in a different PR.

doc/indexing.rst Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Jan 5, 2021

Hello @EricKeenan! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-16 22:28:41 UTC

xarray/core/dataarray.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some suggestions that should hopefully resolve the failures.

@dcherian @keewis you still had some open comments. Do you require more changes or could we merge this on green?

xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
doc/indexing.rst Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
@keewis
Copy link
Collaborator

keewis commented Jan 22, 2021

I would update the example in indexing.rst to use the dataset from the previous examples and maybe copy / adapt the examples to Dataset.sel / Dataset.isel (can be a new PR, though), but otherwise this looks good to me.

@dcherian
Copy link
Contributor

Looks great. Thanks @EricKeenan . I see this is your first PR, welcome to xarray!

@dcherian dcherian merged commit 2ab0666 into pydata:master Feb 16, 2021
dcherian added a commit to dcherian/xarray that referenced this pull request Feb 17, 2021
* upstream/master:
  FIX: h5py>=3 string decoding (pydata#4893)
  Update matplotlib's canonical (pydata#4919)
  Adding vectorized indexing docs (pydata#4711)
  Allow fsspec URLs in open_(mf)dataset (pydata#4823)
  Fix typos in example notebooks (pydata#4908)
  pre-commit autoupdate CI (pydata#4906)
  replace the ci-trigger action with a external one (pydata#4905)
  Update area_weighted_temperature.ipynb (pydata#4903)
  hide the decorator from the test traceback (pydata#4900)
  Sort backends (pydata#4886)
  Compatibility with dask 2021.02.0 (pydata#4884)
@keewis keewis mentioned this pull request Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.sel(...., method='nearest') fails for large requests. Pointwise indexing
6 participants