-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling window with step size #15354
Comments
If you're using 'standard' functions, these are vectorized, and so v fast ( IIUC the saving here would come from only applying the function a fraction of the time (e.g. every nth value). But is there a case where that makes a practical difference? |
this could be done, but i would like to see a usecase where this matters. This would break the 'return same size as input' API as well. Though I don't think this is actually hard to implement (though would involve a number of changes in the implementation). We use marginal windows (IOW, compute the window and as you advance drop off the points that are leaving and add points that you are gaining). So still would have to compute everthing, but you just wouldn't output it. |
Thanks for your replies!
My use case is running aggregation functions (not just max) over some large timeseries dataframes - 400 columns, hours of data at 5-25Hz. I've also done a similar thing (feature engineering on sensor data) in the past with data up to 20kHz. Running 30 second windows with a 5 second step saves a big chunk of processing - e.g. at 25Hz with a 5s step it's 1/125th of the work, which makes the difference between it running in 1 minute or 2 hours. I can obviously fall back to numpy, but it'd be nice if there was a higher level API for doing this. I just thought it was worth the suggestion in case others would find it useful too - I don't expect you to build a feature just for me! |
you can try resamplimg to a higher frequency interval first then rolling something like df = df.resample('30s') |
Hey @jreback, thanks for the suggestion. This would work if I was just running df.resample('1s').max().rolling(30).max() However I'd like to run my reduction function on 30 seconds of data, then move forward 1 second, and run it on the next 30 seconds of data, etc. The method above applies a function on 1 second of data, and then another function on 30 results of the first function. Here's a quick example - running a peak to peak calculation doesn't work running twice (obviously): # 10 minutes of data at 5Hz
n = 5 * 60 * 10
rng = pandas.date_range('1/1/2017', periods=n, freq='200ms')
np.random.seed(0)
d = np.cumsum(np.random.randn(n), axis=0)
s = pandas.Series(d, index=rng)
# Peak to peak
def p2p(d):
return d.max() - d.min()
def p2p_arr(d):
return d.max(axis=1) - d.min(axis=1)
def rolling_with_step(s, window, step, func):
# See https://ga7g08.github.io/2015/01/30/Applying-python-functions-in-moving-windows/
vert_idx_list = np.arange(0, s.size - window, step)
hori_idx_list = np.arange(window)
A, B = np.meshgrid(hori_idx_list, vert_idx_list)
idx_array = A + B
x_array = s.values[idx_array]
idx = s.index[vert_idx_list + int(window/2.)]
d = func(x_array)
return pandas.Series(d, index=idx)
# Plot data
ax = s.plot(figsize=(12, 8), legend=True, label='Data')
# Plot resample then rolling (obviously does not work)
s.resample('1s').apply(p2p).rolling(window=30, center=True).apply(p2p).plot(ax=ax, label='1s p2p, roll 30 p2p', legend=True)
# Plot rolling window with step
rolling_with_step(s, window=30 * 5, step=5, func=p2p_arr).plot(ax=ax, label='Roll 30, step 1s', legend=True) |
@alexlouden from your original description I think something like
IOW, take whatever is in a 5s bin, then reduce it to a single point, then roll over those bins. This general idea is that you have lots of data that can be summarized at a short timescale, but you actually want the rolling of this at a higher level. |
Hey @jreback, I actually want to run a function over 30 seconds of data, every 5 seconds. See the rolling_with_step function in my previous example. The additional step of max/mean doesn't work for my use case. |
@jreback, there is a real need for the step function that hasn't been brought out in this discussion yet. I second everything that @alexlouden has described, but I would like to add more use cases. Suppose that we are doing time-series analysis with input data sampled approximately 3 to 10 milliseconds. We are interested in frequency domain features. The first step in constructing them would be to find out the Nyquist frequency. Suppose by domain knowledge we know that is 10 Hz (once every 100 ms). That means, we need the data to have a frequency of at least 20 Hz (once every 50 ms), if the features should capture the input signal well. We cannot resample to a lower frequency than that. Ultimately here are the computations we do:
Here we chose a window size in multiples of 8, and choosing 32 makes the window size to be 1.6 seconds. The aggregate function returns the single-sided frequency domain coefficients and without the first mean component (the fft function is symmetric and with mean value at 0th element). Following is the sample aggregate function:
Now, we would like to repeat this in a sliding window of, say, every 0.4 seconds or every 0.8 seconds. There is no point in wasting computations and calculating FFT every 50 ms instead and then slicing later. Further, resampling down to 400 ms is not an option, because 400 ms is just 2.5 Hz, which is much lower than Nyquist frequency and doing so will result in all information being lost from the features. This was frequency domain features, which has applications in many time-series related scientific experiments. However, even simpler time-domain aggregate functions such as standard deviation cannot be supported effectively by resampling.
Having the 'step' parameter and being able to reduce actual computations by using it has to be the future goal of Pandas. If the step parameter only returns fewer points, then it's not worth doing, because we can slice the output anyhow. Perhaps given the work involved in doing this, we might just recommend all projects with these needs to use Numpy. |
@Murmuria you are welcome to submit a pull-request to do this. Its actually not that difficult. |
While I second the request for a pandas.concat([
s.resample('30s', label='left', loffset=pandas.Timedelta(15, unit='s'), base=i).agg(p2p)
for i in range(30)
]).sort_index().plot(ax=ax, label='Solution with resample()', legend=True, style='k:') We get the same result (note that the line extends by 30 sec. on both sides): This is still somewhat wasteful, depending on the type of aggregation. For the particular case of peak-to-peak calculation as in @alexlouden 's example, |
The step parameter in rolling would also allow using this feature without a datetime index. Is there anyone already working on it? |
@alexlouden above said this:
Can @alexlouden or anyone else who knows please share some insight as to how to do this with numpy? From my research so far, it seems it is not trivial to do this either in numpy. In fact, there's an open issue about it here numpy/numpy#7753 Thanks |
Hi @tsando - did the function |
@alexlouden thanks, just checked that function and it seems to still depend on pandas (takes a series as an input and also uses the series index). I was wondering if there's a purely numpy approach on this. In the thread i mentioned numpy/numpy#7753 they propose a function which uses numpy strides, but they are hard to understand and translate to window and step inputs. |
@tsando Here's a PDF of the blog post I linked to above - looks like the author has changed his Github username and hasn't put his site up again. (I just ran it locally to convert it to PDF). My function above was me just converting his last example to work with Pandas - if you wanted to use numpy directly you could do something like this: https://gist.github.com/alexlouden/e42f1d96982f7f005e62ebb737dcd987 Hope this helps! |
@alexlouden thanks! I just tried it on an array of shape |
"this could be done, but i would like to see a usecase where this matters." Whatever the project I worked on using pandas, I almost always missed this feature, it is usefull everytime you need to compute the apply only once in a while but still need good resolution inside each window. |
I agree and support this feature too |
Need it almost every time when dealing with time series, the feature could give much better control for generating time series features for both visualization and analysis. Strongly support this idea! |
agree and support this feature too |
This would be very helpful to reduce computing time still keeping a good window resolution. |
I provide a solution codes, which could be further adjusted accordign to your particular target.
|
I agree and support this feature. I see is in stop motion right now. |
Calculating and then downsampling is not an option when you have TBs of data. |
It would be very helpful in what I do as well. I have TBs of data where I need various statistics of non-overlapping windows to understand local conditions. My current "fix" is to just create a generator that slices the data frames and yield's statistics. Would be very helpful to have this feature. |
To contribute to 'further discussion': It would be possible to work around this by using groupby a.s.o. but that seems to be neither intuitive nor as fast as the rolling implementation (2 seconds for 2.5mil hour-long windows with sorting). It's impressively fast and useful, but we really need a stride argument to fully utilize its power. |
…as-dev#15354) and a hint to how to handle iterating windows (pandas-dev#11704)
I took a look at the problem. This is relatively trivial, however the way the code is implemented, from a cursory look I think it'll require someone to slog through manually editing all the rolling routines. None of them respect the window boundaries given by the indexer classes. If they did, both this request as well as #11704 would be very easily solvable. In any case, I think it is manageable for anyone who wants to spend some time sprucing things up. I initiated a half-baked PR (expected to be rejected, just for an MVP) to demonstrate how I would tackle the problem. Running: import numpy as np
import pandas as pd
data = pd.Series(
np.arange(100),
index=pd.date_range('2020/05/12 12:00:00', '2020/05/12 12:00:10', periods=100))
print('1s rolling window every 2s')
print(data.rolling('1s', step='2s').apply(np.mean))
data.sort_index(ascending=False, inplace=True)
print('1s rolling window every 500ms (and reversed)')
print(data.rolling('1s', step='500ms').apply(np.mean)) yields
For implementation details take a look at the PR (or here: https://github.com/anthonytw/pandas/tree/rolling-window-step) While I would have liked to spend more time to finish it up I unfortunately have none left to tackle the grunt work of reworking all the rolling functions. My recommendation for anyone who wants to tackle this would be to enforce the window boundaries generated by the indexer classes and unify the rolling_*_fixed/variable functions. With start and end boundaries I don't see any reason they should be different, unless you have a function which does something special with non-uniformly sampled data (in which case that specific function would be better able to handle the nuance, so maybe set a flag or something). |
Will this also work for a custom window using the |
Hi there, I second also the suggestion please. This would be a really useful feature. |
I have just such an example here: https://stackoverflow.com/questions/63729190/pandas-resample-daily-data-to-annual-data-with-overlap-and-offset Every Nth would be every 365th. The window size is variable over the lifetime of the program and the step is not guaranteed to be an integer fraction of the window size. I basically need a set window size that steps by "# of days in the year it's looking at" which is impossible with every solution I've found for this issue so far. |
I also have a similar need with the following context (adapted from a real and professional need):
As far as I understand, the dataframe.rolling() API allows me to specify the 365 days duration, but not the need to skip 30 days of values (which is a non-constant number of rows) to compute the next mean over another selection of 365 days of values. Obviously, the resulting dataframe I expect will have a (much) smaller number of rows than the initial 'dog events' dataframe. |
Just to gain more clarity about this request with a simple example. If we have this Series:
and we have a window size of
Likewise if we have this time based Series
and we have a window size of
|
@mroeschke wrt to the first example ([3]), the results are not what I would expect. I assume this is a trailing window (e.g., at index=0 it would be the max of elements at -1 and 0, so just max([0]), then it should step forward "1" index, to index=0+step=1, and the next computation would be max([0,1]), then max([1,2]), etc. What it looks like you meant to have was a step size of two, so you would move from index=0 to index=0+2=2 (skipping index 1), and continuing like that. In this case it's almost correct, but there should be no NaNs. While it may be "only" double the size in this case, in other cases it is substantial. For example, I have about an hour's worth of 500Hz ECG data for a patient, that's 1.8 million samples. If I wanted a 5-minute moving average every two minutes, that would be an array of 1.8 million elements with 30 valid computations and slightly less than 1.8 million NaNs. :-) For indexing, step size = 1 is the current behavior, i.e., compute the feature of interest using data in the window, shift the window by one, then repeat. In this example, I want to compute the feature of interest using the data in the window, then shift by 60,000 indices, then repeat. Similar remarks for the time. In this case, there might be some disagreement as to the correct way to implement this type of window, but in my opinion the "best"(TM) way is to start from time t0, find all elements in the range (t0-window, t0], compute the feature, then move by the step size. Throw away any windows that have fewer than the minimum number of elements (can be configurable, default to 1). That example is for a trailing window, but you can modify to fit any window configuration. This has the disadvantage of wasting time in large gaps, but gaps can be handled intelligently and even if you compute the naive way (because you're lazy like me) I've yet to see this matter in practice, since the gaps are usually not large enough to matter in real data. YMMV. Maybe that's clearer? Take a look at my example + code above, that might explain it better. |
Thanks for the clarification @anthonytw. Indeed, looks like I needed to interpret As for the NaNs, I understand the sentiments to drop the NaNs in the output result automatically, but as mentioned in #15354 (comment) by @jreback, there is an API consistency consideration to have the output have the same length as the input. There may be user that would like to keep the NaNs as well (maybe?), and |
@mroeschke I think exceptions should be made. So long as the you put an explicit note in the documentation, and the behavior is not default, no one will be adversely affected by not returning a vector full of junk. Keeping NaNs defeats half the purpose. One objective is to limit the number of times we perform an expensive computation. The other objective is to minimize the feature set to something manageable. That example I gave you is real one, and not nearly as much data as one really has to process in a patient monitoring application. Is it really necessary to allocate 60000x the necessary space, then search through the array to delete NaNs? For each feature we want to compute? Note that one computation might produce an array of values. What do I want to do with an ECG waveform? Well, compute the power spectrum, of course! So I need to allocate then enough space for 1 full PSD vector (150,000 elements) 1.8 million times (2TB of data) then filter through to get the pieces I care about (34MB). For all the series. For all the patients. I guess I need to buy more RAM! It's also worth mentioning that NaN, for some features, might be a meaningful output. In which case, I no longer can tell the difference between a meaningful NaN and the junk NaNs padding the data. While I understand the desire to maintain the API, this is not a feature that will break any existing code (because it's a new feature that didn't exist before), and given the functionality there is no reason anyone would expect it to yield an output of the same size. And even if they did, a note in the documentation for the step size would be sufficient. The disadvantages far outweigh any benefit of having a "consistent" API (for a feature that didn't previously exist, mind you). Not proceeding this way will cripple the feature, it's almost not even worth implementing in that case (in my experience the space cost is almost always the bigger factor). |
Here is how I tried: # for dataframe with datetime index
def rolling_apply_with_strides(
df, window_size=15, strides=5, unit="s", functions=[np.mean]
):
def resample_apply(i):
resampled = df.resample(
f"{window_size}{unit}",
label="left",
offset=f"{i}{unit}",
).agg(functions)
resampled.index = resampled.index + to_offset(f"{window_size-1}{unit}")
return resampled
res = pd.concat(
[
resample_apply(i)
for i in range(0, window_size, strides)
]
).sort_index()
return res Example: res = rolling_apply_with_strides(df, window_size=4, strides=2, unit="s", functions=[np.std, np. mean]) # for dataframe without datetime index
def rolling_apply_with_strides(
df, window_size=15, strides=5, functions=[np.mean]
):
def group_apply(i):
tmp_df = df.groupby(tmp_index.shift(i)).agg(functions)
new_index = df.index[window_size+i-1::window_size]
tmp_df = tmp_df.iloc[:new_index.shape[0],:]
tmp_df.index = new_index
return tmp_df
tmp_index = pd.Series(np.arange(df.shape[0]))
tmp_index = tmp_index//window_size
res = pd.concat(
[
group_apply(i) for i in range(0, window_size, strides)
]
).sort_index()
return res Example: res = rolling_apply_with_strides(df, window_size=4, strides=2, functions=[np.std, np. mean]) Hi, @AlexS12 here is code for non datetime indexed dataframe. |
i working on a PR for this, but i have some questions about the further steps :D (hehe, rolling on floor with steps)..
|
Why is this issue closed? Yes, a step parameter was added, but it doesn't seem to work in the use cases that were used here to argue for the feature? E.g. if I do
Then pandas 2.2.1 throws But wasn't the whole point to use this with frequency windows? I also note that step has to be an integer, which does not generally make sense for timeseries data. A pd.Timedelta step should be allowed. |
Just a suggestion - extend
rolling
to support a rolling window with a step size, such as R'srollapply(by=X)
.Code Sample
Pandas - inefficient solution (apply function to every window, then slice to get every second result)
Suggestion:
Inspired by R (see rollapply docs):
The text was updated successfully, but these errors were encountered: