Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray rolling does not match pandas when using min_periods and reduce #3066

Open
mrezak opened this issue Jun 30, 2019 · 2 comments
Open

xarray rolling does not match pandas when using min_periods and reduce #3066

mrezak opened this issue Jun 30, 2019 · 2 comments

Comments

@mrezak
Copy link

mrezak commented Jun 30, 2019

MCVE Code Sample

MCVE

import numpy as np
import pandas as pd
import xarray

def custom(x, axis=0):
    return np.mean(x, axis)

d = pd.DataFrame(np.random.rand(100,3))
r = d.rolling(10, min_periods=5).apply(custom)
print(r.iloc[0:10,:])

xd = d.to_xarray().to_array()
r = xd.rolling(index=10, min_periods=5).reduce(custom)
print(r[:,0:10])
r = xd.rolling(index=10, min_periods=1).reduce(custom)
print(r[:,0:10])

Problem Description

I am applying a custom function on rolling windows with specific min_periods. The output of pandas..rolling.apply matches what I expect; however, the output of xarray..rolling.reduce doesn't seem to take min_periods into account.

Expected Output and Actual Output

          0         1         2
0       NaN       NaN       NaN
1       NaN       NaN       NaN
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4  0.632168  0.523669  0.543643
5  0.558694  0.565781  0.481204
6  0.559343  0.541787  0.415490
7  0.613457  0.554888  0.398999
8  0.579552  0.496799  0.397681
9  0.562591  0.525096  0.416461
<xarray.DataArray (variable: 3, index: 10)>
array([[     nan,      nan,      nan,      nan,      nan,      nan,      nan,
             nan,      nan, 0.562591],
       [     nan,      nan,      nan,      nan,      nan,      nan,      nan,
             nan,      nan, 0.525096],
       [     nan,      nan,      nan,      nan,      nan,      nan,      nan,
             nan,      nan, 0.416461]])
Coordinates:
  * index     (index) int64 0 1 2 3 4 5 6 7 8 9
  * variable  (variable) int64 0 1 2
<xarray.DataArray (variable: 3, index: 10)>
array([[     nan,      nan,      nan,      nan,      nan,      nan,      nan,
             nan,      nan, 0.562591],
       [     nan,      nan,      nan,      nan,      nan,      nan,      nan,
             nan,      nan, 0.525096],
       [     nan,      nan,      nan,      nan,      nan,      nan,      nan,
             nan,      nan, 0.416461]])
Coordinates:
  * index     (index) int64 0 1 2 3 4 5 6 7 8 9
  * variable  (variable) int64 0 1 2

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 18.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.1

xarray: 0.12.1
pandas: 0.24.2
numpy: 1.16.4
scipy: 1.2.1
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.0.0
distributed: 2.0.1
matplotlib: 3.1.0
cartopy: None
seaborn: 0.9.0
setuptools: 41.0.1
pip: 19.1.1
conda: None
pytest: None
IPython: 7.5.0
sphinx: None

@shoyer
Copy link
Member

shoyer commented Jul 3, 2019

@mrezak Thanks for the report and the clear example!

Certainly this is an annoying inconsistency. I'm trying to figure out whether this is also a bug or not.

I think the difference comes down to how pandas and xarray pass data into the custom function. Pandas passes individual slices, trimming out values outside the window. Xarray passes an N+1 dimensional view of the array data with extra dimension added for the "window offset", with values outside the window filled with NaN:

import numpy as np
import pandas as pd
import xarray

def custom(x, axis=0):
    print(x)
    return np.mean(x, axis)

print('pandas example')
d = pd.DataFrame(np.random.rand(11,3))
r = d.rolling(10, min_periods=5).apply(custom)
print(r.iloc[0:10,:])

print('\nxarray example')
xd = d.to_xarray().to_array()
r = xd.rolling(index=10, min_periods=5).reduce(custom)
print(r[:,0:10])

Output:

pandas example
[0.06130714 0.86751339 0.06688379 0.45866121 0.88848511]
[0.06130714 0.86751339 0.06688379 0.45866121 0.88848511 0.22369799]
[0.06130714 0.86751339 0.06688379 0.45866121 0.88848511 0.22369799
 0.23970828]
[0.06130714 0.86751339 0.06688379 0.45866121 0.88848511 0.22369799
 0.23970828 0.94317625]
[0.06130714 0.86751339 0.06688379 0.45866121 0.88848511 0.22369799
 0.23970828 0.94317625 0.22736209]
[0.06130714 0.86751339 0.06688379 0.45866121 0.88848511 0.22369799
 0.23970828 0.94317625 0.22736209 0.08384912]
[0.86751339 0.06688379 0.45866121 0.88848511 0.22369799 0.23970828
 0.94317625 0.22736209 0.08384912 0.23068875]
[0.87929068 0.81303738 0.62778023 0.34381748 0.55361603]
[0.87929068 0.81303738 0.62778023 0.34381748 0.55361603 0.39705802]
[0.87929068 0.81303738 0.62778023 0.34381748 0.55361603 0.39705802
 0.2023665 ]
[0.87929068 0.81303738 0.62778023 0.34381748 0.55361603 0.39705802
 0.2023665  0.20541754]
[0.87929068 0.81303738 0.62778023 0.34381748 0.55361603 0.39705802
 0.2023665  0.20541754 0.37710566]
[0.87929068 0.81303738 0.62778023 0.34381748 0.55361603 0.39705802
 0.2023665  0.20541754 0.37710566 0.18844817]
[0.81303738 0.62778023 0.34381748 0.55361603 0.39705802 0.2023665
 0.20541754 0.37710566 0.18844817 0.51895952]
[0.33501081 0.67972562 0.08622488 0.89673242 0.94532091]
[0.33501081 0.67972562 0.08622488 0.89673242 0.94532091 0.84144888]
[0.33501081 0.67972562 0.08622488 0.89673242 0.94532091 0.84144888
 0.43766841]
[0.33501081 0.67972562 0.08622488 0.89673242 0.94532091 0.84144888
 0.43766841 0.88536995]
[0.33501081 0.67972562 0.08622488 0.89673242 0.94532091 0.84144888
 0.43766841 0.88536995 0.7662462 ]
[0.33501081 0.67972562 0.08622488 0.89673242 0.94532091 0.84144888
 0.43766841 0.88536995 0.7662462  0.4677236 ]
[0.67972562 0.08622488 0.89673242 0.94532091 0.84144888 0.43766841
 0.88536995 0.7662462  0.4677236  0.7083373 ]
          0         1         2
0       NaN       NaN       NaN
1       NaN       NaN       NaN
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4  0.468570  0.643508  0.588603
5  0.427758  0.602433  0.630744
6  0.400894  0.545281  0.603162
7  0.468679  0.502798  0.638438
8  0.441866  0.488832  0.652639
9  0.406064  0.458794  0.634147

xarray example
[[[       nan        nan        nan        nan        nan        nan
          nan        nan        nan 0.06130714]
  [       nan        nan        nan        nan        nan        nan
          nan        nan 0.06130714 0.86751339]
  [       nan        nan        nan        nan        nan        nan
          nan 0.06130714 0.86751339 0.06688379]
  [       nan        nan        nan        nan        nan        nan
   0.06130714 0.86751339 0.06688379 0.45866121]
  [       nan        nan        nan        nan        nan 0.06130714
   0.86751339 0.06688379 0.45866121 0.88848511]
  [       nan        nan        nan        nan 0.06130714 0.86751339
   0.06688379 0.45866121 0.88848511 0.22369799]
  [       nan        nan        nan 0.06130714 0.86751339 0.06688379
   0.45866121 0.88848511 0.22369799 0.23970828]
  [       nan        nan 0.06130714 0.86751339 0.06688379 0.45866121
   0.88848511 0.22369799 0.23970828 0.94317625]
  [       nan 0.06130714 0.86751339 0.06688379 0.45866121 0.88848511
   0.22369799 0.23970828 0.94317625 0.22736209]
  [0.06130714 0.86751339 0.06688379 0.45866121 0.88848511 0.22369799
   0.23970828 0.94317625 0.22736209 0.08384912]
  [0.86751339 0.06688379 0.45866121 0.88848511 0.22369799 0.23970828
   0.94317625 0.22736209 0.08384912 0.23068875]]

 [[       nan        nan        nan        nan        nan        nan
          nan        nan        nan 0.87929068]
  [       nan        nan        nan        nan        nan        nan
          nan        nan 0.87929068 0.81303738]
  [       nan        nan        nan        nan        nan        nan
          nan 0.87929068 0.81303738 0.62778023]
  [       nan        nan        nan        nan        nan        nan
   0.87929068 0.81303738 0.62778023 0.34381748]
  [       nan        nan        nan        nan        nan 0.87929068
   0.81303738 0.62778023 0.34381748 0.55361603]
  [       nan        nan        nan        nan 0.87929068 0.81303738
   0.62778023 0.34381748 0.55361603 0.39705802]
  [       nan        nan        nan 0.87929068 0.81303738 0.62778023
   0.34381748 0.55361603 0.39705802 0.2023665 ]
  [       nan        nan 0.87929068 0.81303738 0.62778023 0.34381748
   0.55361603 0.39705802 0.2023665  0.20541754]
  [       nan 0.87929068 0.81303738 0.62778023 0.34381748 0.55361603
   0.39705802 0.2023665  0.20541754 0.37710566]
  [0.87929068 0.81303738 0.62778023 0.34381748 0.55361603 0.39705802
   0.2023665  0.20541754 0.37710566 0.18844817]
  [0.81303738 0.62778023 0.34381748 0.55361603 0.39705802 0.2023665
   0.20541754 0.37710566 0.18844817 0.51895952]]

 [[       nan        nan        nan        nan        nan        nan
          nan        nan        nan 0.33501081]
  [       nan        nan        nan        nan        nan        nan
          nan        nan 0.33501081 0.67972562]
  [       nan        nan        nan        nan        nan        nan
          nan 0.33501081 0.67972562 0.08622488]
  [       nan        nan        nan        nan        nan        nan
   0.33501081 0.67972562 0.08622488 0.89673242]
  [       nan        nan        nan        nan        nan 0.33501081
   0.67972562 0.08622488 0.89673242 0.94532091]
  [       nan        nan        nan        nan 0.33501081 0.67972562
   0.08622488 0.89673242 0.94532091 0.84144888]
  [       nan        nan        nan 0.33501081 0.67972562 0.08622488
   0.89673242 0.94532091 0.84144888 0.43766841]
  [       nan        nan 0.33501081 0.67972562 0.08622488 0.89673242
   0.94532091 0.84144888 0.43766841 0.88536995]
  [       nan 0.33501081 0.67972562 0.08622488 0.89673242 0.94532091
   0.84144888 0.43766841 0.88536995 0.7662462 ]
  [0.33501081 0.67972562 0.08622488 0.89673242 0.94532091 0.84144888
   0.43766841 0.88536995 0.7662462  0.4677236 ]
  [0.67972562 0.08622488 0.89673242 0.94532091 0.84144888 0.43766841
   0.88536995 0.7662462  0.4677236  0.7083373 ]]]
<xarray.DataArray (variable: 3, index: 10)>
array([[     nan,      nan,      nan,      nan,      nan,      nan,      nan,
             nan,      nan, 0.406064],
       [     nan,      nan,      nan,      nan,      nan,      nan,      nan,
             nan,      nan, 0.458794],
       [     nan,      nan,      nan,      nan,      nan,      nan,      nan,
             nan,      nan, 0.634147]])
Coordinates:
  * index     (index) int64 0 1 2 3 4 5 6 7 8 9
  * variable  (variable) int64 0 1 2

Xarray's version is certainly going to be way faster, but it has the downside of treating windows differently. One way to work around this would be to use np.nanmean inside custom instead of np.mean.

cc @jhamman @fujiisoup who worked on this and may have ideas

@mrezak
Copy link
Author

mrezak commented Jul 4, 2019

@shoyer thanks for looking into this.

I also figured it later that I can just use np.nanmean (or nanmedian) but that function turns out to be much slower than np.mean (or np.median) version. As nans are only happening as the beginning and end of the sequence, is there any efficient way of using nanmean only for those segments and mean for the rest of the processing? My own thought is to have a check for nan in the custom function and apply mean or nanmean depending on the results of that check, but not sure if this can be done more efficiently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants