-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: rolling.apply will pass Series to function #20584
Conversation
|
||
self._check_expanding(expanding_mean, np.mean) | ||
# TODO(jreback), needed to add preserve_nan=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if anyone is interested to see why this doesn't work, would be hepful, @WillAyd (you have some xp in this area)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like rows [20:39] of the series in use here contains NA values. preserve_nan
is assuming that those will still be NA after the function application, but isn't it by design with expanding functions that the NA values do not get preserved? Ex:
>>> ser = pd.Series(range(5))
>>> ser[2] = np.nan
>>> ser.expanding().sum()
0 0.0
1 1.0
2 1.0
3 4.0
4 8.0
dtype: float64
So I think this keyword is necessary by design of the expanding function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah exactly - so wondering if the current impl (eg master) is actually wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the implementation in detail but I'm assuming this ties back to the default argument of min_periods=1
that the Expanding class uses - I think that would make it almost impossible for NA data to appear while expanding.
I'd buy the argument that the default here should match the Rolling impl (i.e. 0) and Expanding should deal with "expected" NAs (i.e. forward looking values in the window that just haven't been reached yet) vs "unexpected" NA (data from a previous entry in the window) in a more robust fashion.
It would be a breaking change; happy to open the ticket if you agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this might be a bit of an unexplored corner
yeah if i can show a simple example would be great to open an issue
after all expanding is just rolling with the window len of the len of the obj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it shouldn't be though? I.e. should the window size of expanding be i for i in range(N)
instead of N for i in range(N)
? The difference can be illustrated looking at the third element below (elements 4 and 5 are expected to differ from window size)
>>> ser = pd.Series([0, np.nan, 2, 3, 4])
In [127]: ser.rolling(window=2, min_periods=2).sum()
Out[127]:
0 NaN
1 NaN
2 NaN # Doesn't populate here; there aren't 2 non-NA entries at this point
3 5.0
4 7.0
dtype: float64
In [128]: ser.rolling(window=len(ser), min_periods=2).sum()
Out[128]:
0 NaN
1 NaN
2 2.0 # Populates here, even though this is the first non-NA record and min_periods is 2
3 5.0
4 9.0
dtype: float64
Thoughts on the default being ndarray with a warning that the default will change to |
yeah i could make the keyword default (of None) and then warn that it’s going to change that’s back compat, though would have to flip it at some point anyhow |
Codecov Report
@@ Coverage Diff @@
## master #20584 +/- ##
==========================================
- Coverage 91.84% 91.83% -0.01%
==========================================
Files 153 153
Lines 49279 49295 +16
==========================================
+ Hits 45259 45272 +13
- Misses 4020 4023 +3
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.23.0.txt
Outdated
|
||
The :func`Series.rolling`, :func:`DataFrame.rolling`, :func`Series.expanding`, :func:`DataFrame.expanding` methods when used with ``.apply()`` have gained a ``raw=False`` parameter. | ||
This is similar to :func:`DataFame.apply`. This parameter, ``False`` by default allows one to send a ``np.ndarray`` to the applied function, rather than the default of a ``Series``. | ||
This is a change from prior versions, when the applied function would *always* received an ndarray. This allow one to use pandas operations generically. (:issue:`5071`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo on this line - either "would always receive" or "always received"
this is all fixed up. now back-compat (with a warning). if anyone wants to look |
lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice to get it more consistent with DataFrame.apply!
doc/source/whatsnew/v0.23.0.txt
Outdated
Rolling/Expanding.apply() accepts a ``raw`` keyword to pass a ``Series`` to the function | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The :func`Series.rolling`, :func:`DataFrame.rolling`, :func`Series.expanding`, :func:`DataFrame.expanding` methods when used with ``.apply()`` have gained a ``raw=None`` parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could actually link directly to the apply method here (which is more correct, as it is not rolling
that gained a keyword). I think something like :func:`Series.rolling().apply() <pandas.core.window.Rolling.apply>`
might do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a one-line notice to the Deprecations section?
pandas/core/window.py
Outdated
objects instead. | ||
If you are just applying a NumPy reduction function this will | ||
achieve much better performance. | ||
.. versionadded:: 0.23.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line above this one
* ``False`` : passes each row or column as a Series to the | ||
function. | ||
* ``True`` or ``None`` : the passed function will receive ndarray | ||
objects instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you indicate something about the deprecation / changing default?
pandas/core/window.py
Outdated
"for .apply().\nIn the future, this will default to " | ||
"False, meaning a Series will be passed to the " | ||
"applied function. Not passing raw, defaults " | ||
"raw=True, meaning a ndarray is passed to the " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a bit confusing, as in the future this will not be the default. I would try to rephrase a bit like "Now does this, in future will change to this. Pass this to keep current behaviour, pass this to silence the warning"
@@ -314,7 +314,7 @@ def _center_window(self, result, window): | |||
def aggregate(self, arg, *args, **kwargs): | |||
result, how = self._aggregate(arg, *args, **kwargs) | |||
if result is None: | |||
return self.apply(arg, args=args, kwargs=kwargs) | |||
return self.apply(arg, raw=False, args=args, kwargs=kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this have any consequences for the user?
(but, on the other hand, I don't think it is worth to add a raw
keyword just to be able to deprecate the behaviour and change it more slowly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this is a very special case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that a special case, it happens when the user passes a custom function to r.agg(..)
, which eg is done in the documentation about it: http://pandas.pydata.org/pandas-docs/stable/computation.html#aggregation
So just as with apply
, if the user has a function that behaves differently for a Series than an array, this will break.
pandas/core/window.py
Outdated
@@ -955,22 +955,48 @@ def count(self): | |||
---------- | |||
func : function | |||
Must produce a single value from an ndarray input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the "ndarray input" here?
pandas/tests/test_window.py
Outdated
# #1850 | ||
@pytest.mark.parametrize( | ||
'method', [lambda x: x.rolling(window=2), lambda x: x.expanding()]) | ||
def test_apply_future_warning(self, method): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parametrize for Series/DataFrame ? (although they maybe take fully the same code path?)
pandas/tests/test_window.py
Outdated
|
||
result = df.rolling(2).apply(f, raw=False) | ||
expected = df[1:].reindex_like(df) | ||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the case with raw=True
and assert that an AttributeError is raised ? (or pass a function that asserts it gets a numpy array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can you do this test (with such a function that requires a pandas object) also with a freq window?
While trying this out, I also noticed that rolling apply only seems to work on numeric (int/float) data. Is this a known limitation/bug? (didn't directly find an issue about it): In [69]: s = pd.Series(list('abcd'))
In [71]: s.rolling(2).apply(lambda x: x+1, raw=True)
Out[71]:
0 a #<-------- for string data, just returns the original series, whathever the function is
1 b
2 c
3 d
dtype: object
In [72]: s = pd.Series(pd.Categorical(list('abcd')))
In [73]: s.rolling(2).apply(lambda x: x+1, raw=True)
Out[73]:
[a, b, c, d] #<-------- for categorical data, just returns the original Categorical (not even series)
Categories (4, object): [a, b, c, d]
In [74]: s = pd.Series(pd.timedelta_range("1 days", periods=4))
In [75]: s.rolling(2).apply(lambda x: x+1, raw=True) #<----- for timedelta/datetime, raises informative error
...
NotImplementedError: ops for Rolling for this dtype timedelta64[ns] are not implemented (this is a separate issue as this PR, it's already like this in released version) |
5e47143
to
5c5abe5
Compare
any comments. |
@jreback can you please work with additional commits, that makes a second review round really a lot easier for me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to decide about the rolling().agg
case (https://github.com/pandas-dev/pandas/pull/20584/files#r180718011). It got (in certain cases of passing a custom function) a ndarray before, which is now changed to a Series.
I don't think it is worth adding a raw
keyword here as well (people who need it can use apply
), and I am fine with deciding that it is also not worth to deprecate, but then we should do that explicitly (thus also document this API change).
Apart from that, looks good to me, added only some doc comments
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -843,6 +872,9 @@ Deprecations | |||
- ``Index.summary()`` is deprecated and will be removed in a future version (:issue:`18217`) | |||
- ``NDFrame.get_ftype_counts()`` is deprecated and will be removed in a future version (:issue:`18243`) | |||
- The ``convert_datetime64`` parameter in :func:`DataFrame.to_records` has been deprecated and will be removed in a future version. The NumPy bug motivating this parameter has been resolved. The default value for this parameter has also changed from ``True`` to ``None`` (:issue:`18160`). | |||
- :func:`Series.rolling().apply() <pandas.core.window.Rolling.apply>`, :func:`DataFrame.rolling().apply() <pandas.core.window.Rolling.apply>`, | |||
:func:`Series.expanding().apply() <pandas.core.window.Expanding.apply>`, and :func:`DataFrame.expanding().apply() <pandas.core.window.Expanding.apply>` will show a ``FutureWarning`` if the new | |||
``raw`` parameter is not explicity passed (:issue:`20584`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather say what has been deprecated: passing an array by default (and then indeed raw
can be used to keep this behaviour or not)
pandas/core/window.py
Outdated
achieve much better performance. | ||
|
||
The raw parameter is required and will show a FutureWarning if | ||
not passed. In the futures raw will default to False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
futures -> future, raw -> `raw`
pandas/core/window.py
Outdated
@@ -954,23 +954,53 @@ def count(self): | |||
Parameters | |||
---------- | |||
func : function | |||
Must produce a single value from an ndarray input | |||
\*args and \*\*kwargs are passed to the function""") | |||
Must produce a single value from an ndarray input if raw=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quote raw=True as a code snippet (``raw=True``
), same on line below for raw=False
pandas/core/window.py
Outdated
# change to False in the future | ||
if raw is None: | ||
warnings.warn( | ||
"raw defaults to False " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put single quotes around raw? (like "'raw' defaults to False ...") And two lines below as well. I think that will make it read better (clearer that it is a keyword)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think it is the other way around? The warning says "default is False, in future will change to True", but isn't it the other way around?
updated, I adjusted the |
bombs away |
Thanks! |
@jreback hello! I googled insanely, but now I'm even more confused. Is there a way to pass 2D with few columns instead of 1D series to apply(function)? There was a function rolling_apply, but it's depricated. Ended up writing my own function with cycle inside, which emulates rolling window with some cross-column calculations, but it's rather slow. Thank you. |
closes #5071