Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pydata/sparse arrays in DataFrame #33182

Open
TomAugspurger opened this issue Mar 31, 2020 · 6 comments
Open

Support pydata/sparse arrays in DataFrame #33182

TomAugspurger opened this issue Mar 31, 2020 · 6 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action Sparse Sparse Data Type

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 31, 2020

This is a discussion topic for adding the ability to store pydata/sparse ndarrays inside a DataFrame. It's not proposing that we actually do this at this point.

Background

sparse implements a sparse ndarray that implements (much of) the NumPy API. This differs from scipy.sparse matricies which are strictly 2D and have their own API. It also differs from pandas' SparseArray, which is strictly 1D and implements the ExtensionArray interface.

Motivation

In some workflows (especially machine learning) it's common to convert a dense 1D array to a sparse 2D array. The sparse 2D array is often very wide, and so isn't well-suited to storage in a DataFrame with SparseArray values. Each column of the 2D table needs to be stored independently, which at a minimum requires one Python object per column, and makes it difficult (but not impossible) to have these 1D arrays be views on some 2D object.

Since sparse implements the ndarray interface, we can in theory just store the sparse.COO array where we normally store a 2D ndarray, inside or Block. Indeed, with a minor change, this works

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 8deeb415c1..105c1c1a64 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -427,6 +427,7 @@ class DataFrame(NDFrame):
         dtype: Optional[Dtype] = None,
         copy: bool = False,
     ):
+        import sparse
         if data is None:
             data = {}
         if dtype is not None:
@@ -459,7 +460,7 @@ class DataFrame(NDFrame):
                     data = data.copy()
                 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
 
-        elif isinstance(data, (np.ndarray, Series, Index)):
+        elif isinstance(data, (np.ndarray, Series, Index, sparse.COO)):
             if data.dtype.names:
                 data_columns = list(data.dtype.names)
                 data = {k: data[k] for k in data_columns}

Which lets us store the 2D array

In [1]: import sparse

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(sparse.zeros((10, 4)))

In [4]: df._data.blocks[0]
Out[4]: FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [5]: df._data.blocks[0].values
Out[5]: <COO: shape=(4, 10), dtype=float64, nnz=0, fill_value=0.0>

However, many things don't work. Notably

  1. Anything calling asarray(arr) will raise. Sparse doesn't allow implicit conversion from sparse to dense. This includes things like the DataFrame repr
  2. Anything touching Cython (groupby, join, factorize, etc.) will likely raise.

So this would primarily be useful for storing data, at least initially.

Arguments Against

The biggest argument against allowing this is that pandas is potentially moving to a column store in the future. In this future, we don't have 2D blocks, so the value of a 2D sparse array diminishes. We may be able to do similar tricks as we'll do with numpy.ndarray, where 1D columns are views on a 2D object.

The second argument against is that we could potentially make the EA interface 2D, and implement an EA compatible wrapper around a pydata/sparse array (similar to PandasArray).

Finally, we can't really hope to offer "full" support for sparse-backed columns. Things like joining not working on sparse columns will cause user confusion, that may be hard to document and explain.

cc @adrinjalali and @hameerabbasi, just for awareness. Most of the discussion will likely be on the pandas side of things though.

@TomAugspurger TomAugspurger added Internals Related to non-user accessible pandas implementation Sparse Sparse Data Type Needs Discussion Requires discussion from core team before further action labels Mar 31, 2020
@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Mar 31, 2020
@jreback
Copy link
Contributor

jreback commented Mar 31, 2020

sounds like a good argument for supporting 2-D EA. We may find that the current BlockManager actually is a reasonable compromise on efficiency for 2-D.

@hameerabbasi
Copy link

One other argument against is that PyData/Sparse doesn’t support object arrays very nicely.

@jbrockmendel
Copy link
Member

However, many things don't work. Notably [...]

AFAICT getting these many things to work would be basically equivalent to re-implementing EA. We would be better off with just supporting 2D EAs.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 17, 2020

At least for scikit-learn's use case, storing the sparse array is sufficient. Their workflow will be

DataFrame -> array -> transform -> DataFrame -> array -> transform -> DataFrame

They're immediately extracting the array (singular) from the DataFrame.

@jorisvandenbossche
Copy link
Member

Related to the mailing list discussion going on about a simplified BlockManager, where I proposed to have a "lazy" constructor (opt-in) that delays creating a BlockManager until it is actually accessed.
If we add some functionality like that, that could also support sparse arrays in addition to numpy arrays.

https://mail.python.org/pipermail/pandas-dev/2020-May/001228.html

@jbrockmendel
Copy link
Member

has this format caught on?

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action Sparse Sparse Data Type
Projects
None yet
Development

No branches or pull requests

6 participants