API: rolling.apply will pass Series to function #20584

jreback · 2018-04-02T14:09:07Z

jreback · 2018-04-02T14:09:50Z

pandas/tests/test_window.py


-        self._check_expanding(expanding_mean, np.mean)
+        # TODO(jreback), needed to add preserve_nan=False


if anyone is interested to see why this doesn't work, would be hepful, @WillAyd (you have some xp in this area)

It looks like rows [20:39] of the series in use here contains NA values. preserve_nan is assuming that those will still be NA after the function application, but isn't it by design with expanding functions that the NA values do not get preserved? Ex:

>>> ser = pd.Series(range(5)) >>> ser[2] = np.nan >>> ser.expanding().sum() 0 0.0 1 1.0 2 1.0 3 4.0 4 8.0 dtype: float64

So I think this keyword is necessary by design of the expanding function

yeah exactly - so wondering if the current impl (eg master) is actually wrong

I haven't looked at the implementation in detail but I'm assuming this ties back to the default argument of min_periods=1 that the Expanding class uses - I think that would make it almost impossible for NA data to appear while expanding.

I'd buy the argument that the default here should match the Rolling impl (i.e. 0) and Expanding should deal with "expected" NAs (i.e. forward looking values in the window that just haven't been reached yet) vs "unexpected" NA (data from a previous entry in the window) in a more robust fashion.

It would be a breaking change; happy to open the ticket if you agree

yeah this might be a bit of an unexplored corner

yeah if i can show a simple example would be great to open an issue

after all expanding is just rolling with the window len of the len of the obj

Maybe it shouldn't be though? I.e. should the window size of expanding be i for i in range(N) instead of N for i in range(N)? The difference can be illustrated looking at the third element below (elements 4 and 5 are expected to differ from window size)

>>> ser = pd.Series([0, np.nan, 2, 3, 4]) In [127]: ser.rolling(window=2, min_periods=2).sum() Out[127]: 0 NaN 1 NaN 2 NaN # Doesn't populate here; there aren't 2 non-NA entries at this point 3 5.0 4 7.0 dtype: float64 In [128]: ser.rolling(window=len(ser), min_periods=2).sum() Out[128]: 0 NaN 1 NaN 2 2.0 # Populates here, even though this is the first non-NA record and min_periods is 2 3 5.0 4 9.0 dtype: float64

TomAugspurger · 2018-04-02T14:44:58Z

Thoughts on the default being ndarray with a warning that the default will change to raw=False in the future? Then this isn't a breaking change.

jreback · 2018-04-02T22:12:08Z

yeah i could make the keyword default (of None) and then warn that it’s going to change

that’s back compat, though would have to flip it at some point anyhow

codecov · 2018-04-02T22:33:13Z

Codecov Report

Merging #20584 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20584      +/-   ##
==========================================
- Coverage   91.84%   91.83%   -0.01%     
==========================================
  Files         153      153              
  Lines       49279    49295      +16     
==========================================
+ Hits        45259    45272      +13     
- Misses       4020     4023       +3

Flag	Coverage Δ
#multiple	`90.23% <100%> (-0.01%)`	⬇️
#single	`41.89% <23.07%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`95.94% <ø> (ø)`	⬆️
pandas/core/window.py	`96.29% <100%> (+0.02%)`	⬆️
pandas/core/arrays/categorical.py	`95.78% <0%> (-0.41%)`	⬇️
pandas/util/testing.py	`84.38% <0%> (-0.21%)`	⬇️
pandas/io/pytables.py	`92.41% <0%> (-0.05%)`	⬇️
pandas/core/internals.py	`95.53% <0%> (-0.01%)`	⬇️
pandas/core/groupby/groupby.py	`92.55% <0%> (ø)`	⬆️
pandas/core/frame.py	`97.16% <0%> (ø)`	⬆️
pandas/core/indexing.py	`93.08% <0%> (+0.05%)`	⬆️
pandas/core/dtypes/missing.py	`92.85% <0%> (+1.78%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d04b746...ab46058. Read the comment docs.

WillAyd · 2018-04-03T06:21:45Z

doc/source/whatsnew/v0.23.0.txt

+
+The :func`Series.rolling`, :func:`DataFrame.rolling`, :func`Series.expanding`, :func:`DataFrame.expanding` methods when used with ``.apply()`` have gained a ``raw=False`` parameter.
+This is similar to :func:`DataFame.apply`. This parameter, ``False`` by default allows one to send a ``np.ndarray`` to the applied function, rather than the default of a ``Series``.
+This is a change from prior versions, when the applied function would *always* received an ndarray. This allow one to use pandas operations generically. (:issue:`5071`)


Minor typo on this line - either "would always receive" or "always received"

jreback · 2018-04-10T12:05:04Z

this is all fixed up. now back-compat (with a warning). if anyone wants to look

cc @WillAyd
@TomAugspurger

WillAyd · 2018-04-11T03:45:46Z

lgtm

jorisvandenbossche

This is really nice to get it more consistent with DataFrame.apply!

jorisvandenbossche · 2018-04-10T16:35:29Z

doc/source/whatsnew/v0.23.0.txt

+Rolling/Expanding.apply() accepts a ``raw`` keyword to pass a ``Series`` to the function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The :func`Series.rolling`, :func:`DataFrame.rolling`, :func`Series.expanding`, :func:`DataFrame.expanding` methods when used with ``.apply()`` have gained a ``raw=None`` parameter.


You could actually link directly to the apply method here (which is more correct, as it is not rolling that gained a keyword). I think something like :func:`Series.rolling().apply() <pandas.core.window.Rolling.apply>` might do it

Can you also add a one-line notice to the Deprecations section?

jorisvandenbossche · 2018-04-10T18:53:40Z

pandas/core/window.py

+          objects instead.
+          If you are just applying a NumPy reduction function this will
+          achieve much better performance.
+        .. versionadded:: 0.23.0


blank line above this one

jorisvandenbossche · 2018-04-10T18:54:35Z

pandas/core/window.py

+        * ``False`` : passes each row or column as a Series to the
+          function.
+        * ``True`` or ``None`` : the passed function will receive ndarray
+          objects instead.


can you indicate something about the deprecation / changing default?

jorisvandenbossche · 2018-04-10T19:44:07Z

pandas/core/window.py

+                "for .apply().\nIn the future, this will default to "
+                "False, meaning a Series will be passed to the "
+                "applied function. Not passing raw, defaults "
+                "raw=True, meaning a ndarray is passed to the "


I find this a bit confusing, as in the future this will not be the default. I would try to rephrase a bit like "Now does this, in future will change to this. Pass this to keep current behaviour, pass this to silence the warning"

jorisvandenbossche · 2018-04-10T19:45:57Z

pandas/core/window.py

@@ -314,7 +314,7 @@ def _center_window(self, result, window):
    def aggregate(self, arg, *args, **kwargs):
        result, how = self._aggregate(arg, *args, **kwargs)
        if result is None:
-            return self.apply(arg, args=args, kwargs=kwargs)
+            return self.apply(arg, raw=False, args=args, kwargs=kwargs)


Will this have any consequences for the user?
(but, on the other hand, I don't think it is worth to add a raw keyword just to be able to deprecate the behaviour and change it more slowly)

no, this is a very special case

It's not that a special case, it happens when the user passes a custom function to r.agg(..), which eg is done in the documentation about it: http://pandas.pydata.org/pandas-docs/stable/computation.html#aggregation

So just as with apply, if the user has a function that behaves differently for a Series than an array, this will break.

jorisvandenbossche · 2018-04-10T19:50:45Z

pandas/core/window.py

@@ -955,22 +955,48 @@ def count(self):
    ----------
    func : function
        Must produce a single value from an ndarray input


Can you update the "ndarray input" here?

jorisvandenbossche · 2018-04-10T20:36:07Z

pandas/tests/test_window.py

-        # #1850
+    @pytest.mark.parametrize(
+        'method', [lambda x: x.rolling(window=2), lambda x: x.expanding()])
+    def test_apply_future_warning(self, method):


parametrize for Series/DataFrame ? (although they maybe take fully the same code path?)

jorisvandenbossche · 2018-04-10T20:37:22Z

pandas/tests/test_window.py

+
+        result = df.rolling(2).apply(f, raw=False)
+        expected = df[1:].reindex_like(df)
+        tm.assert_frame_equal(result, expected)


Can you add the case with raw=True and assert that an AttributeError is raised ? (or pass a function that asserts it gets a numpy array)

And can you do this test (with such a function that requires a pandas object) also with a freq window?

jorisvandenbossche · 2018-04-11T07:32:16Z

While trying this out, I also noticed that rolling apply only seems to work on numeric (int/float) data. Is this a known limitation/bug? (didn't directly find an issue about it):

In [69]: s = pd.Series(list('abcd'))

In [71]: s.rolling(2).apply(lambda x: x+1, raw=True)   
Out[71]: 
0    a         #<-------- for string data, just returns the original series, whathever the function is
1    b
2    c
3    d
dtype: object

In [72]: s = pd.Series(pd.Categorical(list('abcd')))

In [73]: s.rolling(2).apply(lambda x: x+1, raw=True)
Out[73]: 
[a, b, c, d]     #<-------- for categorical data, just returns the original Categorical (not even series)
Categories (4, object): [a, b, c, d]

In [74]: s = pd.Series(pd.timedelta_range("1 days", periods=4))

In [75]: s.rolling(2).apply(lambda x: x+1, raw=True)   #<----- for timedelta/datetime, raises informative error
...
NotImplementedError: ops for Rolling for this dtype timedelta64[ns] are not implemented

(this is a separate issue as this PR, it's already like this in released version)

jreback · 2018-04-15T18:43:24Z

any comments.

closes pandas-dev#5071

jorisvandenbossche · 2018-04-16T09:25:19Z

@jreback can you please work with additional commits, that makes a second review round really a lot easier for me

jorisvandenbossche

I think we still need to decide about the rolling().agg case (https://github.com/pandas-dev/pandas/pull/20584/files#r180718011). It got (in certain cases of passing a custom function) a ndarray before, which is now changed to a Series.
I don't think it is worth adding a raw keyword here as well (people who need it can use apply), and I am fine with deciding that it is also not worth to deprecate, but then we should do that explicitly (thus also document this API change).

Apart from that, looks good to me, added only some doc comments

jorisvandenbossche · 2018-04-16T09:27:07Z

doc/source/whatsnew/v0.23.0.txt

@@ -843,6 +872,9 @@ Deprecations
 - ``Index.summary()`` is deprecated and will be removed in a future version (:issue:`18217`)
 - ``NDFrame.get_ftype_counts()`` is deprecated and will be removed in a future version (:issue:`18243`)
 - The ``convert_datetime64`` parameter in :func:`DataFrame.to_records` has been deprecated and will be removed in a future version. The NumPy bug motivating this parameter has been resolved. The default value for this parameter has also changed from ``True`` to ``None`` (:issue:`18160`).
+- :func:`Series.rolling().apply() <pandas.core.window.Rolling.apply>`, :func:`DataFrame.rolling().apply() <pandas.core.window.Rolling.apply>`,
+  :func:`Series.expanding().apply() <pandas.core.window.Expanding.apply>`, and :func:`DataFrame.expanding().apply() <pandas.core.window.Expanding.apply>` will show a ``FutureWarning`` if the new
+  ``raw`` parameter is not explicity passed (:issue:`20584`)


I would rather say what has been deprecated: passing an array by default (and then indeed raw can be used to keep this behaviour or not)

jorisvandenbossche · 2018-04-16T09:28:46Z

pandas/core/window.py

+          achieve much better performance.
+
+        The raw parameter is required and will show a FutureWarning if
+        not passed. In the futures raw will default to False.


futures -> future, raw -> `raw`

jorisvandenbossche · 2018-04-16T09:30:03Z

pandas/core/window.py

@@ -954,23 +954,53 @@ def count(self):
    Parameters
    ----------
    func : function
-        Must produce a single value from an ndarray input
-        \*args and \*\*kwargs are passed to the function""")
+        Must produce a single value from an ndarray input if raw=True


quote raw=True as a code snippet (``raw=True``), same on line below for raw=False

jorisvandenbossche · 2018-04-16T09:31:36Z

pandas/core/window.py

+        # change to False in the future
+        if raw is None:
+            warnings.warn(
+                "raw defaults to False "


Can you put single quotes around raw? (like "'raw' defaults to False ...") And two lines below as well. I think that will make it read better (clearer that it is a keyword)

Also, I think it is the other way around? The warning says "default is False, in future will change to True", but isn't it the other way around?

jreback · 2018-04-16T10:13:53Z

updated, I adjusted the .aggregate doc-string & put in an api-note. lmk if any further comments.

jreback · 2018-04-16T14:53:51Z

bombs away

jorisvandenbossche · 2018-04-16T15:03:39Z

Thanks!

H0bbitBaron · 2018-08-02T14:16:31Z

@jreback hello! I googled insanely, but now I'm even more confused. Is there a way to pass 2D with few columns instead of 1D series to apply(function)? There was a function rolling_apply, but it's depricated. Ended up writing my own function with cycle inside, which emulates rolling window with some cross-column calculations, but it's rather slow. Thank you.

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Apr 2, 2018

jreback added this to the 0.23.0 milestone Apr 2, 2018

jreback commented Apr 2, 2018

View reviewed changes

jreback force-pushed the rolling branch from a55f486 to 4594457 Compare April 2, 2018 22:32

WillAyd reviewed Apr 3, 2018

View reviewed changes

jreback force-pushed the rolling branch from 4594457 to 74df5a8 Compare April 10, 2018 12:02

jorisvandenbossche reviewed Apr 11, 2018

View reviewed changes

jsexauer mentioned this pull request Apr 11, 2018

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

jreback force-pushed the rolling branch 4 times, most recently from 5e47143 to 5c5abe5 Compare April 15, 2018 18:41

TomAugspurger approved these changes Apr 15, 2018

View reviewed changes

API: rolling.apply will pass Series to function

946233c

closes pandas-dev#5071

jorisvandenbossche reviewed Apr 16, 2018

View reviewed changes

update to comments

b83c8b1

jreback force-pushed the rolling branch from 5c5abe5 to b83c8b1 Compare April 16, 2018 10:13

jreback and others added 2 commits April 16, 2018 06:14

typo

96c7ae8

typo + reword warning message

ab46058

jreback merged commit 4a34497 into pandas-dev:master Apr 16, 2018

jorisvandenbossche mentioned this pull request Apr 19, 2018

DOC: use apply(raw=True) in docs to silence warning #20741

Merged

jreback mentioned this pull request May 30, 2018

Rolling Data Frames Don't Include Index Values #21236

Closed

jreback mentioned this pull request Nov 26, 2019

DEPR: deprecations log for removed issues #13777

Closed

AlexKirko mentioned this pull request Jul 10, 2020

BUG: indexing with -1 doesn't work anymore in expanding function #35203

Closed

3 tasks


		self._check_expanding(expanding_mean, np.mean)
		# TODO(jreback), needed to add preserve_nan=False

Uh oh!

API: rolling.apply will pass Series to function #20584

API: rolling.apply will pass Series to function #20584

Uh oh!

Conversation

jreback commented Apr 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Apr 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Apr 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Apr 10, 2018

Uh oh!

WillAyd commented Apr 11, 2018

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Apr 11, 2018

Uh oh!

jreback commented Apr 15, 2018

Uh oh!

jorisvandenbossche commented Apr 16, 2018

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

TomAugspurger commented Apr 2, 2018 •

edited

Loading

jreback commented Apr 2, 2018 •

edited

Loading

codecov bot commented Apr 2, 2018 •

edited

Loading