Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERFORMANCE]: isin on CFTimeIndex-backed Coordinate slow #6230

Open
aaronspring opened this issue Feb 1, 2022 · 5 comments
Open

[PERFORMANCE]: isin on CFTimeIndex-backed Coordinate slow #6230

aaronspring opened this issue Feb 1, 2022 · 5 comments

Comments

@aaronspring
Copy link
Contributor

aaronspring commented Feb 1, 2022

Is your feature request related to a problem?

I want to do coord1.isin.coord2 and it is quite slow when coords are large and of object type CFTimeIndex.

import xarray as xr
import numpy as np

n=1000
coord1 = xr.cftime_range(start='2000', freq='MS', periods=n)
coord2 = xr.cftime_range(start='2000', freq='3MS', periods=n)

# cftimeindex: very fast
%timeit coord1.isin(coord2) # 743 µs ± 1.33 µs

# np.isin on index.asi8
%timeit np.isin(coord1.asi8,coord2.asi8) # 7.83 ms ± 14.1 µs

da = xr.DataArray(np.random.random((n,n)),dims=['a','b'],coords={'a':coord1,'b':coord2})

# when xr.DataArray coordinate slow
%timeit da.a.isin(da.b) # 94.9 ms ± 959 µs

# when converting xr.DataArray coordinate back to index slow
%timeit np.isin(da.a.to_index(), da.b.to_index()) # 97.4 ms ± 819 µs

# when converting xr.DataArray coordinate back to index asi
%timeit np.isin(da.a.to_index().asi8, da.b.to_index().asi8) # 7.89 ms ± 15.2 µs

Describe the solution you'd like

faster coord1.isin.coord2 by default. could we re-route here, e.g. to the alternative?

conversion from coordinate to_index() is costly I guess

Describe alternatives you've considered

np.isin(coord1.to_index().asi8, coord2.to_index().asi8 brings me nice speedups in pangeo-data/climpred#724

Additional context

unsure whether this issue should go here on in cftime

@dcherian
Copy link
Contributor

dcherian commented Feb 7, 2022

Do all pd.Index objects define .isin?

If so we could special case Index objects here (note duck_array_ops.isin is numpy.isin):

xarray/xarray/core/common.py

Lines 1423 to 1428 in d47cf0c

return apply_ufunc(
duck_array_ops.isin,
self,
kwargs=dict(test_elements=test_elements),
dask="allowed",
)

cc @shoyer

@shoyer
Copy link
Member

shoyer commented Feb 7, 2022

Yes, I think replacing this with something like lambda x, y: x.isin(y) if isinstance(x, pd.Index) else np.isin(x, y) could work

@shoyer
Copy link
Member

shoyer commented Feb 7, 2022

Oh, I guess the challenge is that apply_ufunc operates on arrays, not indexes. I'm not entirely sure how to deal with this easily....

@dcherian
Copy link
Contributor

dcherian commented Feb 7, 2022

Are we still removing IndexVariable? If not, we could define isin on Variable and IndexVariable and it would be quite clean.

@shoyer
Copy link
Member

shoyer commented Feb 7, 2022

In the long term (cc @benbovy) I think we would ideally split IndexVariable into two classes:

  1. FrozenVariable which is just an immutable Variable, and thus that can be safely used for coordinates that have indexes.
  2. PandasIndexArray which wraps pandas.Index objects to satisfy the np.ndarray interface. This is the object which could allow duck_array_ops.isin to use the pandas.Index.isin method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants