Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: rolling.apply will pass Series to function #20584

Merged
merged 4 commits into from
Apr 16, 2018

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Apr 2, 2018

closes #5071

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Apr 2, 2018
@jreback jreback added this to the 0.23.0 milestone Apr 2, 2018

self._check_expanding(expanding_mean, np.mean)
# TODO(jreback), needed to add preserve_nan=False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if anyone is interested to see why this doesn't work, would be hepful, @WillAyd (you have some xp in this area)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like rows [20:39] of the series in use here contains NA values. preserve_nan is assuming that those will still be NA after the function application, but isn't it by design with expanding functions that the NA values do not get preserved? Ex:

>>> ser = pd.Series(range(5))
>>> ser[2] = np.nan
>>> ser.expanding().sum()
0    0.0
1    1.0
2    1.0
3    4.0
4    8.0
dtype: float64

So I think this keyword is necessary by design of the expanding function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah exactly - so wondering if the current impl (eg master) is actually wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the implementation in detail but I'm assuming this ties back to the default argument of min_periods=1 that the Expanding class uses - I think that would make it almost impossible for NA data to appear while expanding.

I'd buy the argument that the default here should match the Rolling impl (i.e. 0) and Expanding should deal with "expected" NAs (i.e. forward looking values in the window that just haven't been reached yet) vs "unexpected" NA (data from a previous entry in the window) in a more robust fashion.

It would be a breaking change; happy to open the ticket if you agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this might be a bit of an unexplored corner

yeah if i can show a simple example would be great to open an issue

after all expanding is just rolling with the window len of the len of the obj

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it shouldn't be though? I.e. should the window size of expanding be i for i in range(N) instead of N for i in range(N)? The difference can be illustrated looking at the third element below (elements 4 and 5 are expected to differ from window size)

>>> ser = pd.Series([0, np.nan, 2, 3, 4])
In [127]: ser.rolling(window=2, min_periods=2).sum()
Out[127]: 
0    NaN
1    NaN
2    NaN  # Doesn't populate here; there aren't 2 non-NA entries at this point
3    5.0
4    7.0
dtype: float64

In [128]: ser.rolling(window=len(ser), min_periods=2).sum()
Out[128]: 
0    NaN
1    NaN
2    2.0  # Populates here, even though this is the first non-NA record and min_periods is 2
3    5.0
4    9.0
dtype: float64

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 2, 2018

Thoughts on the default being ndarray with a warning that the default will change to raw=False in the future? Then this isn't a breaking change.

@jreback
Copy link
Contributor Author

jreback commented Apr 2, 2018

yeah i could make the keyword default (of None) and then warn that it’s going to change

that’s back compat, though would have to flip it at some point anyhow

@codecov
Copy link

codecov bot commented Apr 2, 2018

Codecov Report

Merging #20584 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20584      +/-   ##
==========================================
- Coverage   91.84%   91.83%   -0.01%     
==========================================
  Files         153      153              
  Lines       49279    49295      +16     
==========================================
+ Hits        45259    45272      +13     
- Misses       4020     4023       +3
Flag Coverage Δ
#multiple 90.23% <100%> (-0.01%) ⬇️
#single 41.89% <23.07%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 95.94% <ø> (ø) ⬆️
pandas/core/window.py 96.29% <100%> (+0.02%) ⬆️
pandas/core/arrays/categorical.py 95.78% <0%> (-0.41%) ⬇️
pandas/util/testing.py 84.38% <0%> (-0.21%) ⬇️
pandas/io/pytables.py 92.41% <0%> (-0.05%) ⬇️
pandas/core/internals.py 95.53% <0%> (-0.01%) ⬇️
pandas/core/groupby/groupby.py 92.55% <0%> (ø) ⬆️
pandas/core/frame.py 97.16% <0%> (ø) ⬆️
pandas/core/indexing.py 93.08% <0%> (+0.05%) ⬆️
pandas/core/dtypes/missing.py 92.85% <0%> (+1.78%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d04b746...ab46058. Read the comment docs.


The :func`Series.rolling`, :func:`DataFrame.rolling`, :func`Series.expanding`, :func:`DataFrame.expanding` methods when used with ``.apply()`` have gained a ``raw=False`` parameter.
This is similar to :func:`DataFame.apply`. This parameter, ``False`` by default allows one to send a ``np.ndarray`` to the applied function, rather than the default of a ``Series``.
This is a change from prior versions, when the applied function would *always* received an ndarray. This allow one to use pandas operations generically. (:issue:`5071`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo on this line - either "would always receive" or "always received"

@jreback
Copy link
Contributor Author

jreback commented Apr 10, 2018

this is all fixed up. now back-compat (with a warning). if anyone wants to look

cc @WillAyd
@TomAugspurger

@WillAyd
Copy link
Member

WillAyd commented Apr 11, 2018

lgtm

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice to get it more consistent with DataFrame.apply!

Rolling/Expanding.apply() accepts a ``raw`` keyword to pass a ``Series`` to the function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The :func`Series.rolling`, :func:`DataFrame.rolling`, :func`Series.expanding`, :func:`DataFrame.expanding` methods when used with ``.apply()`` have gained a ``raw=None`` parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could actually link directly to the apply method here (which is more correct, as it is not rolling that gained a keyword). I think something like :func:`Series.rolling().apply() <pandas.core.window.Rolling.apply>` might do it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a one-line notice to the Deprecations section?

objects instead.
If you are just applying a NumPy reduction function this will
achieve much better performance.
.. versionadded:: 0.23.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line above this one

* ``False`` : passes each row or column as a Series to the
function.
* ``True`` or ``None`` : the passed function will receive ndarray
objects instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you indicate something about the deprecation / changing default?

"for .apply().\nIn the future, this will default to "
"False, meaning a Series will be passed to the "
"applied function. Not passing raw, defaults "
"raw=True, meaning a ndarray is passed to the "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit confusing, as in the future this will not be the default. I would try to rephrase a bit like "Now does this, in future will change to this. Pass this to keep current behaviour, pass this to silence the warning"

@@ -314,7 +314,7 @@ def _center_window(self, result, window):
def aggregate(self, arg, *args, **kwargs):
result, how = self._aggregate(arg, *args, **kwargs)
if result is None:
return self.apply(arg, args=args, kwargs=kwargs)
return self.apply(arg, raw=False, args=args, kwargs=kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this have any consequences for the user?
(but, on the other hand, I don't think it is worth to add a raw keyword just to be able to deprecate the behaviour and change it more slowly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this is a very special case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that a special case, it happens when the user passes a custom function to r.agg(..), which eg is done in the documentation about it: http://pandas.pydata.org/pandas-docs/stable/computation.html#aggregation

So just as with apply, if the user has a function that behaves differently for a Series than an array, this will break.

@@ -955,22 +955,48 @@ def count(self):
----------
func : function
Must produce a single value from an ndarray input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the "ndarray input" here?

# #1850
@pytest.mark.parametrize(
'method', [lambda x: x.rolling(window=2), lambda x: x.expanding()])
def test_apply_future_warning(self, method):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parametrize for Series/DataFrame ? (although they maybe take fully the same code path?)


result = df.rolling(2).apply(f, raw=False)
expected = df[1:].reindex_like(df)
tm.assert_frame_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the case with raw=True and assert that an AttributeError is raised ? (or pass a function that asserts it gets a numpy array)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can you do this test (with such a function that requires a pandas object) also with a freq window?

@jorisvandenbossche
Copy link
Member

While trying this out, I also noticed that rolling apply only seems to work on numeric (int/float) data. Is this a known limitation/bug? (didn't directly find an issue about it):

In [69]: s = pd.Series(list('abcd'))

In [71]: s.rolling(2).apply(lambda x: x+1, raw=True)   
Out[71]: 
0    a         #<-------- for string data, just returns the original series, whathever the function is
1    b
2    c
3    d
dtype: object

In [72]: s = pd.Series(pd.Categorical(list('abcd')))

In [73]: s.rolling(2).apply(lambda x: x+1, raw=True)
Out[73]: 
[a, b, c, d]     #<-------- for categorical data, just returns the original Categorical (not even series)
Categories (4, object): [a, b, c, d]

In [74]: s = pd.Series(pd.timedelta_range("1 days", periods=4))

In [75]: s.rolling(2).apply(lambda x: x+1, raw=True)   #<----- for timedelta/datetime, raises informative error
...
NotImplementedError: ops for Rolling for this dtype timedelta64[ns] are not implemented

(this is a separate issue as this PR, it's already like this in released version)

@jreback
Copy link
Contributor Author

jreback commented Apr 15, 2018

any comments.

@jorisvandenbossche
Copy link
Member

@jreback can you please work with additional commits, that makes a second review round really a lot easier for me

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to decide about the rolling().agg case (https://github.com/pandas-dev/pandas/pull/20584/files#r180718011). It got (in certain cases of passing a custom function) a ndarray before, which is now changed to a Series.
I don't think it is worth adding a raw keyword here as well (people who need it can use apply), and I am fine with deciding that it is also not worth to deprecate, but then we should do that explicitly (thus also document this API change).

Apart from that, looks good to me, added only some doc comments

@@ -843,6 +872,9 @@ Deprecations
- ``Index.summary()`` is deprecated and will be removed in a future version (:issue:`18217`)
- ``NDFrame.get_ftype_counts()`` is deprecated and will be removed in a future version (:issue:`18243`)
- The ``convert_datetime64`` parameter in :func:`DataFrame.to_records` has been deprecated and will be removed in a future version. The NumPy bug motivating this parameter has been resolved. The default value for this parameter has also changed from ``True`` to ``None`` (:issue:`18160`).
- :func:`Series.rolling().apply() <pandas.core.window.Rolling.apply>`, :func:`DataFrame.rolling().apply() <pandas.core.window.Rolling.apply>`,
:func:`Series.expanding().apply() <pandas.core.window.Expanding.apply>`, and :func:`DataFrame.expanding().apply() <pandas.core.window.Expanding.apply>` will show a ``FutureWarning`` if the new
``raw`` parameter is not explicity passed (:issue:`20584`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather say what has been deprecated: passing an array by default (and then indeed raw can be used to keep this behaviour or not)

achieve much better performance.

The raw parameter is required and will show a FutureWarning if
not passed. In the futures raw will default to False.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

futures -> future, raw -> `raw`

@@ -954,23 +954,53 @@ def count(self):
Parameters
----------
func : function
Must produce a single value from an ndarray input
\*args and \*\*kwargs are passed to the function""")
Must produce a single value from an ndarray input if raw=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quote raw=True as a code snippet (``raw=True``), same on line below for raw=False

# change to False in the future
if raw is None:
warnings.warn(
"raw defaults to False "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put single quotes around raw? (like "'raw' defaults to False ...") And two lines below as well. I think that will make it read better (clearer that it is a keyword)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it is the other way around? The warning says "default is False, in future will change to True", but isn't it the other way around?

@jreback
Copy link
Contributor Author

jreback commented Apr 16, 2018

updated, I adjusted the .aggregate doc-string & put in an api-note. lmk if any further comments.

@jreback
Copy link
Contributor Author

jreback commented Apr 16, 2018

bombs away

@jreback jreback merged commit 4a34497 into pandas-dev:master Apr 16, 2018
@jorisvandenbossche
Copy link
Member

Thanks!

@H0bbitBaron
Copy link

@jreback hello! I googled insanely, but now I'm even more confused. Is there a way to pass 2D with few columns instead of 1D series to apply(function)? There was a function rolling_apply, but it's depricated. Ended up writing my own function with cycle inside, which emulates rolling window with some cross-column calculations, but it's rather slow. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH/API: rolling_apply to pass frames to the rolled function (rather than ndarrays)
5 participants