Skip to content

[Python] Massive performance deterioration with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() #38260

Closed
@MMCMA

Description

@MMCMA

Describe the bug, including details regarding any error messages, version, and platform.

We experierence a massiv drop in performance when using pandas 2.1.1 vs. pandas 1.5.3 when invoking pa.Table.from_pandas().
In this example, the conversion time increased from roughly 2.9 seconds to 16.2 seconds. In our data application the problem is evern more dramatic since the size of the dataframe is larger - it seems very sensitive to the number of columns. 2x number of columns yields roughly 4x compute time (num_cols=20000 vs. num_cols=40000). With pandas 1.5.3 the compute time is more linear with the number of columns. Not sure if this should be raised also with pandas.

import pyarrow as pa
import pandas as pd
import numpy as np
import timeit

num_cols = 20000
num_dates = 8800
dates = pd.date_range(start='19900101', freq='b', periods=num_dates)
data = numpy.random.randint(low=0, high=10, size=(num_dates, num_cols))
df = pd.DataFrame(data, index=dates)

tic = timeit.default_timer()
pa.Table.from_pandas(df, preserve_index=True)
total_time = timeit.default_timer() - tic
print(f'Conversion from pandas to pyarrow took {total_time} seconds')

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions