Skip to content

PERF: Arrow dtypes are much slower than Numpy for DataFrame.apply #61747

Open
@ehsantn

Description

@ehsantn

The same code with DataFrame.apply is >4x slower when the data is in Arrow dtypes versus Numpy.

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
import pyarrow as pa
import time

NUM_ROWS = 500_000
df = pd.DataFrame({"A": np.arange(NUM_ROWS) % 30, "B": np.arange(NUM_ROWS)+1.0})
print(df.dtypes)
df2 = df.astype({"A": pd.ArrowDtype(pa.int64()), "B": pd.ArrowDtype(pa.float64())})
print(df2.dtypes)

t0 = time.time()
df.apply(lambda r: 0 if r.A == 0 else (r.B // r.A), axis=1)
print(f"Non-Arrow time: {time.time() - t0:.2f} seconds")

t0 = time.time()
df2.apply(lambda r: 0 if r.A == 0 else (r.B // r.A), axis=1)
print(f"Arrow time: {time.time() - t0:.2f} seconds")

Output with Pandas 2.3 on a local M1 Mac (tested on main branch too).

A      int64
B    float64
dtype: object
A     int64[pyarrow]
B    double[pyarrow]
dtype: object
Non-Arrow time: 3.21 seconds
Arrow time: 16.66 seconds

Installed Versions

INSTALLED VERSIONS

commit : 2cc3762
python : 3.13.5
python-bits : 64
OS : Darwin
OS-release : 24.3.0
Version : Darwin Kernel Version 24.3.0: Thu Jan 2 20:24:16 PST 2025; root:xnu-11215.81.4~3/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.3.0
numpy : 2.2.6
pytz : 2025.2
dateutil : 2.9.0.post0
pip : 25.1.1
Cython : 3.1.2
sphinx : None
IPython : 9.3.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.4
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.5.1
html5lib : None
hypothesis : None
gcsfs : 2025.5.1
jinja2 : None
lxml.etree : None
matplotlib : 3.10.3
numba : 0.61.2
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 19.0.0
pyreadstat : None
pytest : 8.4.1
python-calamine : None
pyxlsb : None
s3fs : 2025.5.1
scipy : 1.15.2
sqlalchemy : 2.0.41
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlsxwriter : 3.2.5
zstandard : 0.23.0
tzdata : 2025.2
qtpy : None
pyqt5 : None

Prior Performance

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapArrowpyarrow functionalityNeeds TriageIssue that has not been reviewed by a pandas team memberPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions