Skip to content

DataArray.sel can silently pick up the nearest point, even if it is far away and the query is out of bounds #8335

Open
@jerabaul29

Description

@jerabaul29

What is your issue?

@paulina-t (who found a bug caused by the behavior we report here in a codebase, where it was badly messing things up).

See the example notebook at https://github.com/jerabaul29/public_bug_reports/blob/main/xarray/2023_10_18/interp.ipynb .


Problem

It is always a bit risky to interpolate / find the nearest neighbor to a query or similar, as bad things can happen if querying a value for a point that is outside of the area that is represented. Fortunately, xarray returns NaN if performing interp outside of the bounds of a dataset:

import xarray as xr
import numpy as np

xr.__version__

'2023.9.0'

data = np.array([[1, 2, 3], [4, 5, 6]])
lat = [10, 20]
lon = [120, 130, 140]

data_xr = xr.DataArray(data, coords={'lat':lat, 'lon':lon}, dims=['lat', 'lon'])

data_xr

<xarray.DataArray (lat: 2, lon: 3)>
array([[1, 2, 3],
       [4, 5, 6]])
Coordinates:
  * lat      (lat) int64 10 20
  * lon      (lon) int64 120 130 140

# interp is civilized: rather than wildly extrapolating, it returns NaN

data_xr.interp(lat=15, lon=125)

<xarray.DataArray ()>
array(3.)
Coordinates:
    lat      int64 15
    lon      int64 125

data_xr.interp(lat=5, lon=125)

<xarray.DataArray ()>
array(nan)
Coordinates:
    lat      int64 5
    lon      int64 125

Unfortunately, .sel will happily find the nearest neighbor of a point, even if the input point is outside of the dataset range:

# sel is not as civilized: it happily finds the neares neighbor, even if it is "on the one side" of the example data

data_xr.sel(lat=5, lon=125, method='nearest')

<xarray.DataArray ()>
array(2)
Coordinates:
    lat      int64 10
    lon      int64 130

This can easily cause tricky bugs.


Discussion

Would it be possible for .sel to have a behavior that makes the user aware of such issues? I.e. either:

  • print a warning on stderr
  • return NaN
  • raise an exception

when performing a .sel query that is outside of a dataset range / not in between of 2 dataset points?

I understand that finding the nearest neighbor may still be useful / wanted in some cases even when being outside of the bounds of the dataset, but the fact that this happens silently by default has been causing bugs for us. Could either this default behavior be changed, or maybe enabled with a flag (allow_extrapolate=False by default for example, so users can consciously opt it in)?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions