Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
The performance issue is the same as #52016. I assume it may be caused by some dependency missing but i installed some accelerating dependency like numba and pyarrow and the time is still larger than 10 seconds. @rhshadrach
import pandas as pd
import numpy as np
shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))
np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())
assert all(isinstance(dtype, pd.BooleanDtype) for dtype in pd_mask.dtypes)
assert all(isinstance(dtype, np.dtype) for dtype in np_mask.dtypes)
# column operations are not that much slower
%timeit pd_mask.any(axis=0)
%timeit np_mask.any(axis=0)
# using pandas.BooleanDtype back end for ROW operations is MUCH SLOWER
%timeit pd_mask.any(axis=1)
%timeit np_mask.any(axis=1)
13.1 ms ± 522 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.7 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.2 s ± 1.41 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.74 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Installed Versions
INSTALLED VERSIONS
commit : a671b5a
python : 3.9.18.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-88-generic
Version : #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.4
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : 3.0.8
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.18.1
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.12.2
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None
Prior Performance
No response