Skip to content

ENH: optimized Groupby.diff() #33658

Closed
Closed
@dequadras

Description

@dequadras

Is your feature request related to a problem?

Doing groupby().diff() with a big dataset and many groups is quite slow. In this image, it is shown how in certain cases optimizing it with numba can get 1000x speed.

image

Describe the solution you'd like

Now, my question is, can this be optimized in pandas?
I realise the case is somehow special, but i've had to work with small groups and I'm finding some speed issues.

API breaking implications

[this should provide a description of how this feature will affect the API]

Describe alternatives you've considered

[this should provide a description of any alternative solutions or features you've considered]

Additional context

Here's the python code in text format

import numpy as np
import pandas as pd
from numba import njit

# create dataframe with many groups
GROUPS = 100000
SIZE = 1000000
df = pd.DataFrame()
df["groups"]=np.random.choice(np.arange(GROUPS), size=SIZE)
df["values"] = np.random.random(size=SIZE)
df.sort_values("groups", inplace=True)

diff_pandas = df.groupby("groups")["values"].diff().values

@njit
def group_diff(groups: np.array, values: np.array, lag: int) -> np.array:
    result_exp_mean = np.empty_like(values, dtype=np.float64)
    for i in range(values.shape[0]):
        if groups[i] == groups[i - lag]:
            result_exp_mean[i] = values[i] - values[i - lag]
        else:
            result_exp_mean[i] = np.nan
    return result_exp_mean

groups = df.groupby("groups").ngroup().values
values = df["values"].values
diff_numba = group_diff(groups, values, 1)

# check that it is equal
np.isclose(diff_pandas, diff_numba, equal_nan=True).all()

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions