Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: df.unstack() is 500 times slower since pandas>=2.1 #58391

Open
2 of 3 tasks
sbonz opened this issue Apr 23, 2024 · 10 comments · May be fixed by #58817
Open
2 of 3 tasks

PERF: df.unstack() is 500 times slower since pandas>=2.1 #58391

sbonz opened this issue Apr 23, 2024 · 10 comments · May be fixed by #58817
Assignees
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@sbonz
Copy link

sbonz commented Apr 23, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.random(size=(10000, 100)))
st = time.time()
df.unstack() # this operation takes 500x more in pandas>=2.1
print(f"time {time.time() -st}")

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c1 python : 3.11.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252

pandas : 2.2.1
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : 2.8.7
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

same code as above is 500x faster for pandas<=2.0.3.
Issue happens on Windows and Linux, with Python 3.10 and 3.12, with backend numpy and pyarrow.
The slow down seems to be in the stack_v3 function in the initial loop.

@sbonz sbonz added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Apr 23, 2024
@jbrockmendel
Copy link
Member

Cc @rhshadrach

@asishm
Copy link
Contributor

asishm commented Apr 24, 2024

on main it's about 5x faster than on 2.2.2 but still extremely slow compared to 2.0.3

on 2.0.3 -> 17ms
2.2.2 -> 5.4 s
main -> 1.08 s

@sam-baumann
Copy link
Contributor

take

@sam-baumann
Copy link
Contributor

Looked into this. In the sample code from the original issue, the df being used for testing is just random values, rather than the result of a stack(). The following code actually runs 2-3x faster on main than 2.0.3 on my machine.

Seems like the performance issue only comes up when the df is not in the form expected by unstack(). @sbonz did you see this on real data?

import pandas as pd
import numpy as np
import time
data = np.random.randint(0, 100,size=(100000, 1000))
df = pd.DataFrame(data=data).stack()

st = time.time()
df.unstack() 
print(f"time {time.time() -st}")

@sbonz
Copy link
Author

sbonz commented Apr 28, 2024

@sam-baumann yes, I noticed the slowdown because some tests (with real data) in our pipeline started timing out.

@sbonz
Copy link
Author

sbonz commented May 15, 2024

@sam-baumann I was wondering if you have had any chance to look at this?

@sam-baumann
Copy link
Contributor

Hi @sbonz - I did look a bit further into this - I think I may have to remove myself from this issue because I don't think I'm familiar enough with this part of the codebase to be of much more help here. Sorry!

@sam-baumann sam-baumann removed their assignment May 16, 2024
@mroeschke
Copy link
Member

@rhshadrach would it make sense to carve out a fastpath in stack_v3 for a homogenously typed DataFrame with unique columns to just do frame._values.ravel()?

@rhshadrach
Copy link
Member

Yea - I think adding a fastpath makes sense. I'm going to make an attempt shortly.

@rhshadrach rhshadrach self-assigned this May 17, 2024
@rhshadrach rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 19, 2024
@jbrockmendel
Copy link
Member

just do frame._values.ravel()?

Just a note on this: consider arr.reshape(-1) since ravel can make a copy in some cases.

@rhshadrach rhshadrach linked a pull request Jun 2, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants