Closed
Description
import timeit
import numpy as np
from pandas import DataFrame
from xray import Dataset, DataArray
df = DataFrame({"a": np.r_[np.arange(500.), np.arange(500.)],
"b": np.arange(1000.)})
print(timeit.repeat('df.groupby("a").agg("mean")', globals={"df": df}, number=10))
print(timeit.repeat('df.groupby("a").agg(np.mean)', globals={"df": df, "np": np}, number=10))
ds = Dataset({"a": DataArray(np.r_[np.arange(500.), np.arange(500.)]),
"b": DataArray(np.arange(1000.))})
print(timeit.repeat('ds.groupby("a").mean()', globals={"ds": ds}, number=10))
This outputs
[0.010462284000823274, 0.009770361997652799, 0.01081446700845845]
[0.02622630601399578, 0.024328112005605362, 0.018717073995503597]
[2.2804569930012804, 2.1666158599982737, 2.2688316510029836]
i.e. xray's groupby is ~100 times slower than pandas' one (and 200 times slower than passing "mean"
to pandas' groupby, which I assume involves some specialization).
(This is the actual order or magnitude of the data size and redundancy I want to handle, i.e. thousands of points with very limited duplication.)