Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group by on two columns and transform very slow #2485

Closed
davidjhp opened this issue Nov 28, 2020 · 2 comments
Closed

Group by on two columns and transform very slow #2485

davidjhp opened this issue Nov 28, 2020 · 2 comments
Labels
bug 🦗 Something isn't working

Comments

@davidjhp
Copy link

davidjhp commented Nov 28, 2020

Centos-release-7-5.1804.5.el7.centos.x86_64
Modin 0.8.1.1
Python 3.8.3 (default, Jul 2 2020, 16:21:59) [GCC 7.3.0] :: Anaconda, Inc. on linux

The following takes a few seconds in Pandas, in Modin takes hours.

#import pandas as pd
import modin.pandas as pd
import time

data = [
        [1, 1, 1],
        [1, 1, 2],
        [1, 2, 3],
        [1, 2, 4],
        [2, 1, 10],
        [2, 1, 20],
        [2, 2, 30],
        [2, 2, 40],
    ] 

df = pd.DataFrame(data, columns = ['card_id', 'day', 'amount'])
df = pd.concat([df for _ in range(1000000)])
start = time.time()
df['count'] = df.groupby(['card_id', 'day'])["amount"].transform('count')
df['sum'] = df.groupby(['card_id', 'day'])["amount"].transform('sum')
df['std'] = df.groupby(['card_id', 'day'])["amount"].transform('std')
df['min'] = df.groupby(['card_id', 'day'])["amount"].transform('min')
df['max'] = df.groupby(['card_id', 'day'])["amount"].transform('max')
end = time.time()
print("{0} seconds".format((end - start)))

@YarShev
Copy link
Collaborator

YarShev commented Dec 1, 2020

Hi @davidjhp , thanks for posting! Please, look at this reply. It's interesting why you are creating df via such logic? It that really necessary?

@devin-petersohn
Copy link
Collaborator

This seems like a benchmarking attempt, but @davidjhp might be missing some key understanding of how Modin works.

In general, this type of dataframe creation is extremely unusual. Does the same issue occur if you write the data out to a csv file first?

@davidjhp davidjhp closed this as completed Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants