Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bottleneck and dask objects ignore min_periods on rolling #4922

Open
bradyrx opened this issue Feb 18, 2021 · 5 comments
Open

Bottleneck and dask objects ignore min_periods on rolling #4922

bradyrx opened this issue Feb 18, 2021 · 5 comments

Comments

@bradyrx
Copy link
Contributor

bradyrx commented Feb 18, 2021

What happened:

When bottleneck is installed in an environment, it seems to ignore the min_periods kwarg on ds.rolling(...).

What you expected to happen:

When using ds.rolling(..., min_periods=1), it should be able to handle an array of length 1. Without bottleneck installed, it returns the original value of a length 1 array. With bottleneck installed, the error is:

ValueError: Moving window (=2) must between 1 and 1, inclusive

Minimal Complete Verifiable Example:

With bottleneck installed to environment:

import xarray as xr
ds = xr.DataArray([1], dims='time')
ds.rolling(time=2, center=True, min_periods=1).mean()
ValueError: Moving window (=2) must between 1 and 1, inclusive

Without bottleneck installed to environment:

import xarray as xr
ds = xr.DataArray([1], dims='time')
ds.rolling(time=2, center=True, min_periods=1).mean()

>>> <xarray.DataArray (time: 1)>
>>> array([1.])
>>> Dimensions without coordinates: time

Anything else we need to know?:

In an applied case, this came up while working on .groupby('time.dayofyear').map(_rolling), where we map a rolling mean function over a defined N days with min_periods=1. Some climatological days (like leap years) will not have the N day requirement, so the min_period catch handles that, but with bottleneck installed it breaks due to the above issue.

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:22:12) [Clang 11.0.1 ] python-bits: 64 OS: Darwin OS-release: 19.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4

xarray: 0.16.2
pandas: 1.2.1
numpy: 1.19.5
scipy: 1.6.0
netCDF4: 1.5.5.1
pydap: None
h5netcdf: 0.8.1
h5py: 3.1.0
Nio: None
zarr: 2.6.1
cftime: 1.3.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.1.8
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.01.1
distributed: 2021.01.1
matplotlib: 3.3.3
cartopy: 0.18.0
seaborn: 0.11.1
numbagg: None
pint: 0.16.1
setuptools: 49.6.0.post20210108
pip: 21.0
conda: None
pytest: 6.2.2
IPython: 7.18.1
sphinx: 3.4.3

@dcherian
Copy link
Contributor

Maybe the padding is breaking down for length-1 arrays?

I would look at padded in this function

values = func(
padded.data, window=self.window[0], min_count=min_count, axis=axis
)

@dcherian dcherian added the bug label Feb 18, 2021
@bradyrx
Copy link
Contributor Author

bradyrx commented Mar 4, 2021

@dcherian, to add to the complexity here, it's even weirder than originally reported. See my test cases below. This might alter how this bug is approached.

import xarray as xr

def _rolling(ds):
    return ds.rolling(time=6, center=False, min_periods=1).mean()

# Length 3 array to test that min_periods is called in, despite asking
# for 6 time-steps of smoothing
ds = xr.DataArray([1, 2, 3], dims='time')
ds['time'] = xr.cftime_range(start='2021-01-01', freq='D', periods=3)

1. With bottleneck installed, min_periods is ignored as a kwarg with in-memory arrays.

(bottleneck installed)

# Just apply rolling to the base array.
ds.rolling(time=6, center=False, min_periods=1).mean()

>>> ValueError: Moving window (=6) must between 1 and 3, inclusive

# Group into single day climatology groups and apply
ds.groupby('time.dayofyear').map(_rolling)

>>> ValueError: Moving window (=6) must between 1 and 1, inclusive

2. With bottleneck uninstalled, min_periods works with in-memory arrays.

(bottleneck uninstalled)

# Just apply rolling to the base array.
ds.rolling(time=6, center=False, min_periods=1).mean()

>>> <xarray.DataArray (time: 3)>
>>> array([1. , 1.5, 2. ])
>>> Coordinates:
>>> * time     (time) object 2021-01-01 00:00:00 ... 2021-01-03 00:00:00

# Group into single day climatology groups and apply
ds.groupby('time.dayofyear').map(_rolling)

>>> <xarray.DataArray (time: 3)>
>>> array([1., 2., 3.])
>>> Coordinates:
>>>  * time     (time) object 2021-01-01 00:00:00 ... 2021-01-03 00:00:00

3. Regardless of bottleneck, dask objects ignore min_period when a groupby object.

This specifically seems like an issue with .map()

(independent of bottleneck installation)

# Just apply rolling to the base array.
ds.chunk().rolling(time=6, center=False, min_periods=1).mean().compute()

>>> <xarray.DataArray (time: 3)>
>>> array([1. , 1.5, 2. ])
>>> Coordinates:
>>>   * time     (time) object 2021-01-01 00:00:00 ... 2021-01-03 00:00:00

# Group into single day climatology groups and apply
ds.chunk().groupby('time.dayofyear').map(_rolling)

>>> ValueError: For window size 6, every chunk should be larger than 3, but the smallest chunk size is 1.
>>> Rechunk your array
>>> with a larger chunk size or a chunk size that
>>> more evenly divides the shape of your array.

@dcherian
Copy link
Contributor

dcherian commented Mar 4, 2021

# Just apply rolling to the base array.
ds.rolling(time=6, center=False, min_periods=1).mean()

I feel like this should not work i.e. rolling window length (6) < size along axis (3). So the bottleneck error seems right.

The chunk size error in the last example should go away with #4977

@bradyrx
Copy link
Contributor Author

bradyrx commented Mar 5, 2021

I feel like this should not work i.e. rolling window length (6) < size along axis (3). So the bottleneck error seems right.

This is normally the case, but with min_periods=1 it should just return the given value so long as there's at least one observation (as in case #2, where the boundaries return as normal and the middle number is smoothed).

Thanks for the pointer on #4977!

@bradyrx bradyrx changed the title Bottleneck ignores min_periods on rolling Bottleneck and dask objects ignore min_periods on rolling Mar 9, 2021
@schild
Copy link

schild commented Dec 19, 2021

encountered the same problem by Bottleneck.move_rank() ;
i have to judge the length of dataframe in advance

 vol_list =[3,5,10,20,30,60,90]
    for i in vol_list:
        if len(last_df) >= i:
            last_df['vol_big_than_today_count_'+str(i)] = move_rank(last_df['vol'], i)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants