[BUG] Rolling window aggregations are very slow with large windows #15119
Open
Description
With large windows, the .rolling()
function in cuDF can be pathologically slow:
In [6]: dt = cudf.date_range("2001-01-01", "2002-01-01", freq="1s")
In [7]: df = cudf.DataFrame({"x": np.random.rand(len(dt))}, index=dt)
In [8]: %time df.rolling("1D").sum()
CPU times: user 10.3 s, sys: 57.1 ms, total: 10.3 s
Wall time: 10.4 s
Out[8]:
x
2001-01-01 00:00:00 0.815418
2001-01-01 00:00:01 1.238151
2001-01-01 00:00:02 1.811390
2001-01-01 00:00:03 2.065794
2001-01-01 00:00:04 2.195230
... ...
2001-12-31 23:59:55 43308.909704
2001-12-31 23:59:56 43309.098228
2001-12-31 23:59:57 43308.658888
2001-12-31 23:59:58 43308.790256
2001-12-31 23:59:59 43308.915838
[31536000 rows x 1 columns]
Why is it slow?
Of the 10s of execution time above, about 8s is spent in computing the window sizes, which is done in a hand-rolled numba CUDA kernel:
cudf/python/cudf/cudf/utils/cudautils.py
Line 17 in 6f6e521
column.full
) - but that's a red herring I think, because there's no synchronization after the numba kernel call.
What can we do about it?
I see a couple of options here:
- I wonder if there's a better way to write that kernel. Currently, it naively launches one thread per element, and does a linear search for the next element that would exceed the window bounds.
- We could make it
libcudf
's responsibility to compute the window sizes. I believe they already do window sizes computation in the context of grouped rolling window aggreagations: seegrouped_range_rolling_window()
.
Activity