BUG: Column of Lists becomes Column of ndarray after writing to feather or parquet formats #49623

Mostly-BSD · 2022-11-10T17:58:50Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import tempfile

# str_list is a column of lists of strings
df = pd.DataFrame(
    {
        "id": [1,2,3,4,5],
        "str_list": [
            ["a", "aa", "aaa"],
            ["b", "bb","bbb", "bbbb"],
            ["c", "cc"],
            ["d"],
            ["e","ee","eee"]
        ]
    })

# will output list
type(df['str_list'][0])

# Write to a feather file and read back the feather file
df.to_feather(tempfile.gettempdir()+"/df.feather")
df2 = pd.read_feather(tempfile.gettempdir()+"/df.feather")

# will output numpy.ndarray
type(df2['str_list'][0])

Issue Description

When a dataframe with column of lists is written to feather or parquet formats and then subsequently read back the column of lists becomes column of numpy.ndarray.

A column of lists is especially useful for NLP tasks where the corpus needs to be a list of list of strings. Passing a column of ndarray breaks this and requires an intermediate step of converting each ndarray in the column to a list.

Expected Behavior

Expected behavior - If a column of lists is written to feather/parquet files, it should be read back as a column of lists and not numpy.ndarray.

FWIW - to_pickle and subsequent read_pickle don't have this issue as can be seen by the code below

df.to_pickle(tempfile.gettempdir()+"/df.pickle")
df3 = pd.read_pickle(tempfile.gettempdir()+"/df.pickle")
type(df3['str_list'][0])

Installed Versions

list
numpy.ndarray

INSTALLED VERSIONS

commit : 91111fd
python : 3.9.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.16-200.fc36.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Sun Oct 16 22:50:04 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.32
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : 2022.10.0
gcsfs : None
matplotlib : 3.6.2
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 10.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : None
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

eshanja1n · 2022-11-10T21:54:01Z

take

kostyafarber · 2022-12-09T21:56:39Z

I'm not able to reproduce this

TypeError: Argument 'table' has incorrect type (expected pyarrow.lib.Table, got DataFrame)

Mostly-BSD added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 10, 2022

github-actions bot assigned eshanja1n Nov 10, 2022

phofl added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Column of Lists becomes Column of ndarray after writing to feather or parquet formats #49623

BUG: Column of Lists becomes Column of ndarray after writing to feather or parquet formats #49623

Mostly-BSD commented Nov 10, 2022

INSTALLED VERSIONS

eshanja1n commented Nov 10, 2022

kostyafarber commented Dec 9, 2022

BUG: Column of Lists becomes Column of ndarray after writing to feather or parquet formats #49623

BUG: Column of Lists becomes Column of ndarray after writing to feather or parquet formats #49623

Comments

Mostly-BSD commented Nov 10, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

eshanja1n commented Nov 10, 2022

kostyafarber commented Dec 9, 2022