Skip to content

Support for sparse dataframes #1894

Closed
Closed
@jamestwebber

Description

@jamestwebber

GitHub Issues for Apache Arrow

I opened a Pandas issue but they closed it and referred me here. I was hoping to use parquet files as a way to share pandas.SparseDataFrame objects, but it appears that the current to_parquet method fails with columns of different lengths:

import pandas as pd # v0.22.0
import scipy.sparse # v1.0.1

rpd = pd.SparseDataFrame(scipy.sparse.random(1000, 1000), 
                         columns=list(map(str, range(1000))),
                         default_fill_value=0.0)
rpd.to_parquet('rpd.pq')
---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
<ipython-input-65-1aeaae9e36a0> in <module>()
      4                          columns=list(map(str, range(1000))),
      5                          default_fill_value=0.0)
----> 6 rpd.to_parquet('rpd.pq')

...

ArrowIOError: Column 8 had 4 while previous column had 8

Poking around, Pandas is just passing things straight into pyarrow, so I guess there's no support for sparse matrices at the moment? This seems like it'd be a nice use-case because the columns can be heavily compressed, but the current implementation needs to a dense version, which necessitates a round-trip via giant memory consumption.

Are there plans to improve this support, or is this not a good use for the format?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions