Closed
Description
GitHub Issues for Apache Arrow
I opened a Pandas issue but they closed it and referred me here. I was hoping to use parquet files as a way to share pandas.SparseDataFrame
objects, but it appears that the current to_parquet
method fails with columns of different lengths:
import pandas as pd # v0.22.0
import scipy.sparse # v1.0.1
rpd = pd.SparseDataFrame(scipy.sparse.random(1000, 1000),
columns=list(map(str, range(1000))),
default_fill_value=0.0)
rpd.to_parquet('rpd.pq')
---------------------------------------------------------------------------
ArrowIOError Traceback (most recent call last)
<ipython-input-65-1aeaae9e36a0> in <module>()
4 columns=list(map(str, range(1000))),
5 default_fill_value=0.0)
----> 6 rpd.to_parquet('rpd.pq')
...
ArrowIOError: Column 8 had 4 while previous column had 8
Poking around, Pandas is just passing things straight into pyarrow, so I guess there's no support for sparse matrices at the moment? This seems like it'd be a nice use-case because the columns can be heavily compressed, but the current implementation needs to a dense version, which necessitates a round-trip via giant memory consumption.
Are there plans to improve this support, or is this not a good use for the format?
Metadata
Metadata
Assignees
Labels
No labels