-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API? how to do a "rolling groupby" or groupby with overlapping groups? #26959
Comments
I might be missing the point as I was having a hard time running code samples as is but wouldn't this do it for you? df.groupby(pd.Grouper(freq='1D')).count().rolling('30D').sum() If you get the number of daily transactions first before doing the rolling op should be much more efficient |
I think something like that was suggested as well on twitter, but: that does not take the unique values over a 30D period, but only for a single day. |
So would #26958 solve this then? I think this will always be two steps (i.e. perform the rolling calc then resample / reshape as desired) from an API perspective |
No, you can already do a "nunique" operation at the moment, but you need to type the verbose
With the current API, yes, you need two steps. But that's also why I opened this issue, because doing it in two steps is very non-performant. The possible solution I mentioned above like: df.rolling("30D")['repo_id'].apply(lambda x: len(pd.unique(x)), raw=True).resample('D').last()
# df.rolling("30D").nunique().resample('D').last() # after #26958 is kind of working, but is much slower than the function written by @betatim in https://gist.github.com/betatim/c59039682d92fab89859358e8c585313 (which just uses a python for loop over the groups that get selected with |
If so wouldn't the bottleneck there just be the apply op? If that is implemented I don't think performance would be a concern. From an API perspective I can't imagine right now that rolling should do any kind of resampling or reduction on its own. |
Another way to put it: it's a rolling operation, but instead of it row-by-row (calculating the statistic moving the window one row at a time), your window shifts with bigger jumps. |
I still don't totally see how this should fit into the API design of rolling; seems like it should still be two separate ops to me to roll and group, and in that case I think your other PR regarding nunique to get good performance covers the need here. If I'm mistaken though can you maybe provide a smaller reproducible example that highlights the expected values? |
Is there a precedent for this in other applications? Taking Postgres as an example by their definition A window function performs a calculation across a set of table rows that are somehow related to the current row https://www.postgresql.org/docs/current/tutorial-window.html So curious to see what everyone else is doing. Going back to the OP I'm still not entirely clear as to what is driving the larger discussion here. The main problem there seemed to be performance which is totally unrelated to the API design of rolling. Why do we want to cram both the rolling and reshaping into one operation? By definition now rolling will return something that matches the size of the calling object, so with your proposals I think we'd either be breaking that assumption or essentially broadcasting the result of those rolling operations across a frequency window. In either case I don't see why we wouldn't want to leave the latter operation to a resample or a transformation |
Previous enhancement requests asking for a |
Yup, I also encounter the same missing functionality. My proposed solution in #28302 also addresses this issue. Currently, pd.Grouper ignores loffset, it uses base for split boundary offset. Ideally, there needs to be 3 parameters, loffset (label_offset, default=0, the starting boundary offset of every group label), gstart (group_start, default=loffset, the starting offset of every group's data) and gspan (group_span, default=1, the span of every group's data). For example, if we want to split a time-series data from 9:00am every day to 9:00am the next day, while keeping the group label at 0am (so that the datetime object reduces to date, i.e., 2020-05-26 00:00:00 => 2020-05-26), however, within each group we want the past 2 days of data (i.e., from 9am two days before that day til 9am on that day) in addition to that day's data (i.e., from 9am on that day til 9am the next day), [in other words, there will be duplicate entries among adjacent groups and the total number of data rows will be tripled], then we should call pd.Grouper(freq='D', loffset=0, gstart=9.0/24-2, gspan=3). This will give rise to: |
Thanks for sharing your insights. Any idea to get the workaround working for the following scenario: https://stackoverflow.com/questions/62351499/how-to-group-pandas-dataframe-by-date-with-overlapping-sliding-window |
Sparked by @betatim's question on twitter: https://twitter.com/betatim/status/1141321049918906368
Suppose you have the following data:
Data at somewhat randomly points in time (here hourly, but precision doesn't matter much).
Assume now you want to calculate a rolling 30D (monthly) statistic for each day.
For example the rolling number of unique values over a period of time (the elegant but somewhat verbose python code to do this: https://gist.github.com/betatim/c59039682d92fab89859358e8c585313)
A rolling operation does not exactly give what you want, because
df.rolling()
will calculate this 24H mean for each row, while there might be many rows for a single day, and you are only interested in 1 value for that day.A groupby/resample operation can also not achieve this goal, as that could give monthly number of unique values, but it cannot shift this month period one day at a time, since the groups cannot be overlapping.
With rolling you can do something like
df.rolling("30D").nunique().resample('D').last()
. But this is 1) very slow, as therolling
in the first step is doing way too many calculations that you afterwards throw away withresample().last()
, and 2) I am not sure the binning for timestamps within the same day happens fully as desired.I found this an interesting use case, to think about if there are ways to better handle this in pandas, or how such an API could look like.
The text was updated successfully, but these errors were encountered: