-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
Sparked by @betatim's question on twitter: https://twitter.com/betatim/status/1141321049918906368
Suppose you have the following data:
repo_id = np.random.choice(np.arange(1000), 10000)
index = pd.Timestamp("2019-01-01") + pd.to_timedelta(np.random.randint(0, 24*30*6, size=10000), unit='H')
df = pd.DataFrame({'repo_id': repo_id}, index=index).sort_index()
In [111]: df.head()
Out[111]:
repo_id
2019-01-01 01:00:00 162
2019-01-01 01:00:00 850
2019-01-01 01:00:00 414
2019-01-01 02:00:00 753
2019-01-01 02:00:00 125
Data at somewhat randomly points in time (here hourly, but precision doesn't matter much).
Assume now you want to calculate a rolling 30D (monthly) statistic for each day.
For example the rolling number of unique values over a period of time (the elegant but somewhat verbose python code to do this: https://gist.github.com/betatim/c59039682d92fab89859358e8c585313)
A rolling operation does not exactly give what you want, because df.rolling() will calculate this 24H mean for each row, while there might be many rows for a single day, and you are only interested in 1 value for that day.
A groupby/resample operation can also not achieve this goal, as that could give monthly number of unique values, but it cannot shift this month period one day at a time, since the groups cannot be overlapping.
With rolling you can do something like df.rolling("30D").nunique().resample('D').last(). But this is 1) very slow, as the rolling in the first step is doing way too many calculations that you afterwards throw away with resample().last(), and 2) I am not sure the binning for timestamps within the same day happens fully as desired.
I found this an interesting use case, to think about if there are ways to better handle this in pandas, or how such an API could look like.