Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API? how to do a "rolling groupby" or groupby with overlapping groups? #26959

Open
jorisvandenbossche opened this issue Jun 20, 2019 · 11 comments
Labels

Comments

@jorisvandenbossche
Copy link
Member

Sparked by @betatim's question on twitter: https://twitter.com/betatim/status/1141321049918906368

Suppose you have the following data:

repo_id = np.random.choice(np.arange(1000), 10000)
index = pd.Timestamp("2019-01-01") + pd.to_timedelta(np.random.randint(0, 24*30*6, size=10000), unit='H')
df = pd.DataFrame({'repo_id': repo_id}, index=index).sort_index()
In [111]: df.head() 
Out[111]: 
                     repo_id
2019-01-01 01:00:00      162
2019-01-01 01:00:00      850
2019-01-01 01:00:00      414
2019-01-01 02:00:00      753
2019-01-01 02:00:00      125

Data at somewhat randomly points in time (here hourly, but precision doesn't matter much).

Assume now you want to calculate a rolling 30D (monthly) statistic for each day.
For example the rolling number of unique values over a period of time (the elegant but somewhat verbose python code to do this: https://gist.github.com/betatim/c59039682d92fab89859358e8c585313)

A rolling operation does not exactly give what you want, because df.rolling() will calculate this 24H mean for each row, while there might be many rows for a single day, and you are only interested in 1 value for that day.
A groupby/resample operation can also not achieve this goal, as that could give monthly number of unique values, but it cannot shift this month period one day at a time, since the groups cannot be overlapping.

With rolling you can do something like df.rolling("30D").nunique().resample('D').last(). But this is 1) very slow, as the rolling in the first step is doing way too many calculations that you afterwards throw away with resample().last(), and 2) I am not sure the binning for timestamps within the same day happens fully as desired.

I found this an interesting use case, to think about if there are ways to better handle this in pandas, or how such an API could look like.

@WillAyd
Copy link
Member

WillAyd commented Jun 20, 2019

I might be missing the point as I was having a hard time running code samples as is but wouldn't this do it for you?

df.groupby(pd.Grouper(freq='1D')).count().rolling('30D').sum()

If you get the number of daily transactions first before doing the rolling op should be much more efficient

@WillAyd WillAyd added Usage Question Window rolling, ewma, expanding labels Jun 20, 2019
@jorisvandenbossche
Copy link
Member Author

I think something like that was suggested as well on twitter, but: that does not take the unique values over a 30D period, but only for a single day.
You cannot sum unique daily values to get unique monthly values .. (it's not just the number of "transactions", but the number of unique transactions. If it was only the number, yes, then that would have been additive).

@WillAyd
Copy link
Member

WillAyd commented Jun 20, 2019

So would #26958 solve this then? I think this will always be two steps (i.e. perform the rolling calc then resample / reshape as desired) from an API perspective

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 20, 2019

So would #26958 solve this then?

No, you can already do a "nunique" operation at the moment, but you need to type the verbose .apply(pd.Series.nunique, raw=False) for that. So I just opened that issue to make the shortcut rolling().nunique() also work.

I think this will always be two steps (i.e. perform the rolling calc then resample / reshape as desired) from an API perspective

With the current API, yes, you need two steps. But that's also why I opened this issue, because doing it in two steps is very non-performant. The possible solution I mentioned above like:

df.rolling("30D")['repo_id'].apply(lambda x: len(pd.unique(x)), raw=True).resample('D').last()
# df.rolling("30D").nunique().resample('D').last()  # after #26958 

is kind of working, but is much slower than the function written by @betatim in https://gist.github.com/betatim/c59039682d92fab89859358e8c585313 (which just uses a python for loop over the groups that get selected with .loc and combine the results in a dataframe).
This is because the rolling step is doing way too many calculations that you afterwards throw away with resample().last()

@WillAyd
Copy link
Member

WillAyd commented Jun 20, 2019

The possible solution I mentioned above like:

df.rolling("30D")['repo_id'].apply(lambda x: len(pd.unique(x)), raw=True).resample('D').last()

is kind of working, but is much slower than the function written by @betatim in

If so wouldn't the bottleneck there just be the apply op? If that is implemented I don't think performance would be a concern.

From an API perspective I can't imagine right now that rolling should do any kind of resampling or reduction on its own.

@jorisvandenbossche
Copy link
Member Author

Another way to put it: it's a rolling operation, but instead of it row-by-row (calculating the statistic moving the window one row at a time), your window shifts with bigger jumps.
Only in this case, the jumps are not a regular number of rows, but related to a datetime frequency.

@WillAyd
Copy link
Member

WillAyd commented Jun 21, 2019

Another way to put it: it's a rolling operation, but instead of it row-by-row (calculating the statistic moving the window one row at a time), your window shifts with bigger jumps.
Only in this case, the jumps are not a regular number of rows, but related to a datetime frequency.

I still don't totally see how this should fit into the API design of rolling; seems like it should still be two separate ops to me to roll and group, and in that case I think your other PR regarding nunique to get good performance covers the need here.

If I'm mistaken though can you maybe provide a smaller reproducible example that highlights the expected values?

@WillAyd
Copy link
Member

WillAyd commented Jun 25, 2019

I've thought about this a bit more and I think the current API isn't just limited, it's wrong.
By my reasoning, when you ask for rolling(3), you're asking for a window of 3 rows and the engine rolls one row at a time so by analogy when you ask for rolling("3D"), since you're asking for a 3 day window, the engine should roll 1 day at a time.

Is there a precedent for this in other applications? Taking Postgres as an example by their definition

A window function performs a calculation across a set of table rows that are somehow related to the current row

https://www.postgresql.org/docs/current/tutorial-window.html

So curious to see what everyone else is doing.

Going back to the OP I'm still not entirely clear as to what is driving the larger discussion here. The main problem there seemed to be performance which is totally unrelated to the API design of rolling.

Why do we want to cram both the rolling and reshaping into one operation? By definition now rolling will return something that matches the size of the calling object, so with your proposals I think we'd either be breaking that assumption or essentially broadcasting the result of those rolling operations across a frequency window. In either case I don't see why we wouldn't want to leave the latter operation to a resample or a transformation

@ghost
Copy link

ghost commented Jul 14, 2019

Previous enhancement requests asking for a stride argument to rolling: #15354, #22976, #27654 (comment), dask/dask#4659, numpy/numpy#7753

@xuancong84
Copy link

xuancong84 commented Jun 15, 2020

Yup, I also encounter the same missing functionality. My proposed solution in #28302 also addresses this issue.

Currently, pd.Grouper ignores loffset, it uses base for split boundary offset.

Ideally, there needs to be 3 parameters, loffset (label_offset, default=0, the starting boundary offset of every group label), gstart (group_start, default=loffset, the starting offset of every group's data) and gspan (group_span, default=1, the span of every group's data).

For example, if we want to split a time-series data from 9:00am every day to 9:00am the next day, while keeping the group label at 0am (so that the datetime object reduces to date, i.e., 2020-05-26 00:00:00 => 2020-05-26), however, within each group we want the past 2 days of data (i.e., from 9am two days before that day til 9am on that day) in addition to that day's data (i.e., from 9am on that day til 9am the next day), [in other words, there will be duplicate entries among adjacent groups and the total number of data rows will be tripled], then we should call pd.Grouper(freq='D', loffset=0, gstart=9.0/24-2, gspan=3). This will give rise to:
[('2020-01-01T00:00:00', <data from 2019-12-30T09:00:00 to 2020-01-02T09:00:00>), ('2020-01-02T00:00:00', <data from 2019-12-31T09:00:00 to 2020-01-03T09:00:00>), ('2020-01-03T00:00:00', <data from 2020-01-01T09:00:00 to 2020-01-04T09:00:00>), ('2020-01-04T00:00:00', <data from 2020-01-02T09:00:00 to 2020-01-05T09:00:00>), ...]

@mbrb
Copy link

mbrb commented Jun 17, 2020

Yup, I also encounter the same missing functionality. My proposed solution in #28302 also addresses this issue.

Currently, pd.Grouper ignores loffset, it uses base for split boundary offset.

Ideally, there needs to be 3 parameters, loffset (label_offset, default=0, the starting boundary offset of every group label), gstart (group_start, default=loffset, the starting offset of every group's data) and gspan (group_span, default=1, the span of every group's data).

For example, if we want to split a time-series data from 9:00am every day to 9:00am the next day, while keeping the group label at 0am (so that the datetime object reduces to date, i.e., 2020-05-26 00:00:00 => 2020-05-26), however, within each group we want the past 2 days of data (i.e., from 9am two days before that day til 9am on that day) in addition to that day's data (i.e., from 9am on that day til 9am the next day), [in other words, there will be duplicate entries among adjacent groups and the total number of data rows will be tripled], then we should call pd.Grouper(freq='D', loffset=0, gstart=9.0/24-2, gspan=3). This will give rise to:
[('2020-01-01T00:00:00', <data from 2019-12-30T09:00:00 to 2020-01-02T09:00:00>), ('2020-01-02T00:00:00', <data from 2019-12-31T09:00:00 to 2020-01-03T09:00:00>), ('2020-01-03T00:00:00', <data from 2020-01-01T09:00:00 to 2020-01-04T09:00:00>), ('2020-01-04T00:00:00', <data from 2020-01-02T09:00:00 to 2020-01-05T09:00:00>), ...]

Thanks for sharing your insights. Any idea to get the workaround working for the following scenario: https://stackoverflow.com/questions/62351499/how-to-group-pandas-dataframe-by-date-with-overlapping-sliding-window

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants