Skip to content

Group by on two columns and transform very slow #2485

@davidjhp

Description

@davidjhp

Centos-release-7-5.1804.5.el7.centos.x86_64
Modin 0.8.1.1
Python 3.8.3 (default, Jul 2 2020, 16:21:59) [GCC 7.3.0] :: Anaconda, Inc. on linux

The following takes a few seconds in Pandas, in Modin takes hours.

#import pandas as pd
import modin.pandas as pd
import time

data = [
        [1, 1, 1],
        [1, 1, 2],
        [1, 2, 3],
        [1, 2, 4],
        [2, 1, 10],
        [2, 1, 20],
        [2, 2, 30],
        [2, 2, 40],
    ] 

df = pd.DataFrame(data, columns = ['card_id', 'day', 'amount'])
df = pd.concat([df for _ in range(1000000)])
start = time.time()
df['count'] = df.groupby(['card_id', 'day'])["amount"].transform('count')
df['sum'] = df.groupby(['card_id', 'day'])["amount"].transform('sum')
df['std'] = df.groupby(['card_id', 'day'])["amount"].transform('std')
df['min'] = df.groupby(['card_id', 'day'])["amount"].transform('min')
df['max'] = df.groupby(['card_id', 'day'])["amount"].transform('max')
end = time.time()
print("{0} seconds".format((end - start)))

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug 🦗Something isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions