Skip to content

BUG: pd.read_parquet drops indexes when mode.dtype_backend='pyarrow' #51717

Closed
@rachtsingh

Description

@rachtsingh

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [1]: pdf = pd.DataFrame({"idx": [0, 1, 2], "A": [1, 2, 3]}).set_index('idx')
[PYFLYBY] import pandas as pd

In [2]: pdf
Out[2]:
     A
idx
0    1
1    2
2    3

In [3]: pdf.to_parquet("~/tmp/test.parquet")

In [4]: with pd.option_context("mode.dtype_backend", "pyarrow"):
   ...:     df = pd.read_parquet("~/tmp/test.parquet")
   ...:

In [5]: df
Out[5]:
   A  idx
0  1    0
1  2    1
2  3    2

In [6]: pd.__version__
Out[6]: '2.1.0.dev0+93.g6bb8f73e75'

Issue Description

Roundtripping a DataFrame with an index to parquet and back under the options mode mode.type_backend='pyarrow' causes indexes to be dropped.

Expected Behavior

I think pd.read_parquet should read from the schema (in particular pandas_metadata['index_columns'] to set the index. I don't think this is a bug in a lower level library like pyarrow since I think (?) indexes are a Pandas-specific abstraction. I assume there's some reason that setting the option mode skips pandas_compat._reconstruct_index, but I can't figure it out.

This is what I expect to see instead:

In [16]: df = df.set_index('idx')

In [17]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64[pyarrow]
dtypes: int64[pyarrow](1)
memory usage: 50.0 bytes

In [18]: df.index
Out[18]: Index([0, 1, 2], dtype='int64[pyarrow]', name='idx')

Thanks for the help!

Installed Versions

INSTALLED VERSIONS

commit : 6bb8f73
python : 3.10.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.90.1-microsoft-standard-WSL2
Version : #1 SMP Fri Jan 27 02:56:13 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0.dev0+93.g6bb8f73e75
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.3
Cython : 0.29.32
pytest : 7.1.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.6
brotli : None
fastparquet : 0.8.3
fsspec : 2022.8.2
gcsfs : None
matplotlib : 3.5.3
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2022.8.2
scipy : 1.9.1
snappy : None
sqlalchemy : 1.4.41
tables : 3.8.0
tabulate : None
xarray : 2023.2.0
xlrd : 2.0.1
zstandard : 0.18.0
tzdata : None
qtpy : 2.3.0
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityBugClosing CandidateMay be closeable, needs more eyeballsIO Parquetparquet, feather

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions