Description
Code Sample
import pandas as pd # v0.24.2
import scipy.sparse # v1.1.0
df = pd.SparseDataFrame(scipy.sparse.random(1000, 1000),
columns=list(map(str, range(1000))),
default_fill_value=0.0)
df.to_parquet('rpd.pq', engine='pyarrow')
Gives the error
ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column 0 with type Sparse[float64, 0.0]')
Problem description
This error occurs when trying to save a Pandas sparse DataFrame using the to_parquet
method. The error can be avoided by running df.to_dense().to_parquet()
. However, this can require a lot of memory for very large sparse matrices.
The issue was also raised apache/arrow#1894 and #20692
Expected Output
The expected output is a parquet file on disk.
INSTALLED VERSIONS
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: 3.9.1
pip: 19.0.3
setuptools: 40.2.0
Cython: None
numpy: 1.16.3
scipy: 1.1.0
pyarrow: 0.13.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.1.2
lxml.etree: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None