Group by on two columns and transform very slow #2485

davidjhp · 2020-11-28T19:37:20Z

Centos-release-7-5.1804.5.el7.centos.x86_64
Modin 0.8.1.1
Python 3.8.3 (default, Jul 2 2020, 16:21:59) [GCC 7.3.0] :: Anaconda, Inc. on linux

The following takes a few seconds in Pandas, in Modin takes hours.

#import pandas as pd
import modin.pandas as pd
import time

data = [
        [1, 1, 1],
        [1, 1, 2],
        [1, 2, 3],
        [1, 2, 4],
        [2, 1, 10],
        [2, 1, 20],
        [2, 2, 30],
        [2, 2, 40],
    ] 

df = pd.DataFrame(data, columns = ['card_id', 'day', 'amount'])
df = pd.concat([df for _ in range(1000000)])
start = time.time()
df['count'] = df.groupby(['card_id', 'day'])["amount"].transform('count')
df['sum'] = df.groupby(['card_id', 'day'])["amount"].transform('sum')
df['std'] = df.groupby(['card_id', 'day'])["amount"].transform('std')
df['min'] = df.groupby(['card_id', 'day'])["amount"].transform('min')
df['max'] = df.groupby(['card_id', 'day'])["amount"].transform('max')
end = time.time()
print("{0} seconds".format((end - start)))

The text was updated successfully, but these errors were encountered:

YarShev · 2020-12-01T06:57:19Z

Hi @davidjhp , thanks for posting! Please, look at this reply. It's interesting why you are creating df via such logic? It that really necessary?

devin-petersohn · 2020-12-01T14:44:23Z

This seems like a benchmarking attempt, but @davidjhp might be missing some key understanding of how Modin works.

In general, this type of dataframe creation is extremely unusual. Does the same issue occur if you write the data out to a csv file first?

davidjhp added the bug 🦗 Something isn't working label Nov 28, 2020

dchigarev mentioned this issue Nov 30, 2020

FEAT-#2375: implementation of multi-column groupby aggregation #2461

Merged

6 tasks

davidjhp closed this as completed Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group by on two columns and transform very slow #2485

Group by on two columns and transform very slow #2485

davidjhp commented Nov 28, 2020 •

edited

Loading

YarShev commented Dec 1, 2020

devin-petersohn commented Dec 1, 2020

Group by on two columns and transform very slow #2485

Group by on two columns and transform very slow #2485

Comments

davidjhp commented Nov 28, 2020 • edited Loading

YarShev commented Dec 1, 2020

devin-petersohn commented Dec 1, 2020

davidjhp commented Nov 28, 2020 •

edited

Loading