Support pydata/sparse arrays in DataFrame #33182
Labels
Closing Candidate
May be closeable, needs more eyeballs
Enhancement
ExtensionArray
Extending pandas with custom dtypes or arrays.
Internals
Related to non-user accessible pandas implementation
Needs Discussion
Requires discussion from core team before further action
Sparse
Sparse Data Type
This is a discussion topic for adding the ability to store pydata/sparse ndarrays inside a DataFrame. It's not proposing that we actually do this at this point.
Background
sparse
implements a sparse ndarray that implements (much of) the NumPy API. This differs fromscipy.sparse
matricies which are strictly 2D and have their own API. It also differs from pandas' SparseArray, which is strictly 1D and implements the ExtensionArray interface.Motivation
In some workflows (especially machine learning) it's common to convert a dense 1D array to a sparse 2D array. The sparse 2D array is often very wide, and so isn't well-suited to storage in a DataFrame with SparseArray values. Each column of the 2D table needs to be stored independently, which at a minimum requires one Python object per column, and makes it difficult (but not impossible) to have these 1D arrays be views on some 2D object.
Since
sparse
implements the ndarray interface, we can in theory just store thesparse.COO
array where we normally store a 2D ndarray, inside or Block. Indeed, with a minor change, this worksWhich lets us store the 2D array
However, many things don't work. Notably
asarray(arr)
will raise. Sparse doesn't allow implicit conversion from sparse to dense. This includes things like the DataFrame reprSo this would primarily be useful for storing data, at least initially.
Arguments Against
The biggest argument against allowing this is that pandas is potentially moving to a column store in the future. In this future, we don't have 2D blocks, so the value of a 2D sparse array diminishes. We may be able to do similar tricks as we'll do with numpy.ndarray, where 1D columns are views on a 2D object.
The second argument against is that we could potentially make the EA interface 2D, and implement an EA compatible wrapper around a pydata/sparse array (similar to
PandasArray
).Finally, we can't really hope to offer "full" support for sparse-backed columns. Things like joining not working on sparse columns will cause user confusion, that may be hard to document and explain.
cc @adrinjalali and @hameerabbasi, just for awareness. Most of the discussion will likely be on the pandas side of things though.
The text was updated successfully, but these errors were encountered: