Description
Code Sample, a copy-pastable example if possible
# Series of bools
s = pd.Series(np.random.randint(0, 2, 100000)).astype(bool)
# ~1.45 ms
%timeit s.any(skipna=True)
# ~1.35 ms
%timeit s.any(skipna=False)
# ~6.5 us - Note that I get a message about possible caching, but
# even after multiplying by worst case multiplier, still an order of
# magnitude faster than s.any()
%timeit s.values.any()
# Series of ints
s2 = pd.Series(np.random.randint(0, 2, 100000))
# ~330 us
%timeit s2.any(skipna=True)
# ~280 us
%timeit s2.any(skipna=False)
# ~90 us - No possible caching warning on this one
%timeit s2.values.any()
Problem description
Calling Series.any is much slower than calling Series.values.any on a series of bools
Interestingly, calling Series.any on a series of ints is quite a bit faster than on a series of bools, though even if it is a series of ints, Series.values.any is still faster.
I ran with both skipna=True and skipna=False in case it was an issue of how NaNs are being handled.
I see the same time differences with Series.all
Expected Output
I would expect the performance to be comparable. Maybe not exactly the same,, but not order(s) of magnitude slower.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.3
pytest: 3.3.0
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.1
scipy: 0.18.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.1
feather: None
matplotlib: 1.5.1
openpyxl: 2.5.6
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.5.1
html5lib: 0.9999999
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: 0.0.8
fastparquet: None
pandas_gbq: None
pandas_datareader: None