-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
SparseArray is an ExtensionArray #22325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ee187eb
32c1372
b265659
8dfc898
9c57725
13952ab
7a6e7fa
1016af1
072abec
0ad61cc
5b0b524
224744a
620b5fb
164c401
65f83d6
0b3c682
69a5d13
f2b5862
fa80fc5
3f20890
484adb0
1df1190
4246ac4
a849699
c4da319
a2f158f
26b671a
375e160
0a37050
3c2cb0f
27c6378
e52dae9
b6d8430
640c4a5
6b61597
427234f
e055629
a79359c
de3aa71
21f4ee3
c1e594a
dc7f93f
eb09d21
7dcf4b2
b39658a
a8b76bd
e041313
595535e
7700299
f1ff7da
33fa6f7
40c035e
1d49cc7
6f4b6b6
6f037b5
7da220e
bfbe4ab
c5666b6
ff6037c
5c362ef
55cac36
c4e8784
a00f987
a6d7eac
4b4f9bd
82801be
1a149dc
fde19d7
a7ba8f6
5064217
e31e8aa
79c8e9c
26993fe
6eeec11
50de326
5ef1747
f31970c
f1b860f
5c44275
33bc8f8
9bf13ad
de1fb5b
da580cd
88b73c3
afde64d
e603d3d
ec5eb9a
a72ee1a
f147635
c35c7c2
e159ef2
d48a8fa
3bcf57e
31d401f
a4369c2
608b499
14e60c9
550f163
821cc91
e21ed21
aeb8c8c
34c90ed
2103959
26af959
e5920c2
084a967
bb17760
dde7852
f1b4e6b
6a31077
02aa7f7
3a7ee2d
d6fe191
b1ea874
2213b83
94664c4
e54160c
04a2dbb
fb01d1a
f78ae81
11d5b40
ba70753
82bab3c
2990124
a9d0f17
0c52c37
998f113
38b0356
7206d94
fe771b5
12e424c
3bd567f
f816346
1a1dcf4
e3d9173
2715cdb
4e40599
0aa3934
a3becb6
5660b9a
dd3cba5
cc65b8a
06dce5f
f7351d3
2055494
f310322
0008164
027f6d8
c0d9875
44b218c
47fa73a
c2c489f
3729927
9ba49e1
543ac7c
f66ef6f
ba8fc9d
9185e33
11799ab
73e7626
ebece16
7db6990
be21f42
e857363
d0ee038
54f4417
2082d86
f846606
ce8e0ac
1f6590e
b758469
f6b0924
232518c
e8b37da
0197e0c
62326ae
f008c38
88c6126
5c8662e
78798cf
b051424
78979b6
2333db1
b41d473
d6a2479
a23c27c
7372eb3
cab8c54
52ae275
9c9b49e
f5d7492
b4b4cbc
bf98b9d
f3d2681
7d4d3ba
57c03c2
0dbc33e
c217cf5
2ea7a91
8f2f228
c83bed7
53e494e
627b9ce
df0293a
a590418
7821f19
ee26c52
40390f1
15a164d
88432c8
3e7ec90
7b0a179
20d8815
3e81c69
1098a7a
10d204a
69075d8
0764baa
a4a47c5
a5b6c39
70d8268
7aed79f
11e55aa
11606af
2f73179
1b3058a
f4ec928
8c67ca2
cc89ec7
3f713d4
886fe03
75099af
731fc06
f91141d
37a4b57
4aad8e1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -380,6 +380,37 @@ is the case with :attr:`Period.end_time`, for example | |
|
||
p.end_time | ||
|
||
.. _whatsnew_0240.api_breaking.sparse_values: | ||
|
||
Sparse Data Structure Refactor | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
``SparseArray``, the array backing ``SparseSeries`` and the columns in a ``SparseDataFrame``, | ||
is now an extension array (:issue:`21978`, :issue:`19056`, :issue:`22835`). | ||
To conform to this interface and for consistency with the rest of pandas, some API breaking | ||
changes were made: | ||
|
||
- ``SparseArray`` is no longer a subclass of :class:`numpy.ndarray`. To convert a SparseArray to a NumPy array, use :meth:`numpy.asarray`. | ||
- ``SparseArray.dtype`` and ``SparseSeries.dtype`` are now instances of :class:`SparseDtype`, rather than ``np.dtype``. Access the underlying dtype with ``SparseDtype.subtype``. | ||
- :meth:`numpy.asarray(sparse_array)` now returns a dense array with all the values, not just the non-fill-value values (:issue:`14167`) | ||
- ``SparseArray.take`` now matches the API of :meth:`pandas.api.extensions.ExtensionArray.take` (:issue:`19506`): | ||
|
||
* The default value of ``allow_fill`` has changed from ``False`` to ``True``. | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* The ``out`` and ``mode`` parameters are now longer accepted (previously, this raised if they were specified). | ||
* Passing a scalar for ``indices`` is no longer allowed. | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a ``SparseSeries``. | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- ``SparseDataFrame.combine`` and ``DataFrame.combine_first`` no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray. | ||
- Setting :attr:`SparseArray.fill_value` to a fill value with a different dtype is now allowed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't remember if I asked before, but do we actually want this?
I don't think the above makes much sense, so not sure this is good to allow. For me it seems logical to restrict the fill_value of the same dtype as the data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The somewhat strange thing is that on master we do allow that in the SparseArray constructor In [13]: s = pd.SparseArray([1, 2, 0], fill_value=np.nan)
In [14]: s
Out[14]:
[1, 2, 0]
Fill: nan
IntIndex
Indices: array([0, 1, 2], dtype=int32) I don't have strong opinions here, other than that people shouldn't be setting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i agree the fill type should match the dtype but since missing value support is allowed here it is prob ok. |
||
|
||
|
||
Some new warnings are issued for operations that require or are likely to materialize a large dense array: | ||
|
||
- A :class:`errors.PerformanceWarning` is issued when using fillna with a ``method``, as a dense array is constructed to create the filled array. Filling with a ``value`` is the efficient way to fill a sparse array. | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- A :class:`errors.PerformanceWarning` is now issued when concatenating sparse Series with differing fill values. The fill value from the first sparse array continues to be used. | ||
|
||
In addition to these API breaking changes, many :ref:`performance improvements and bug fixes have been made <whatsnew_0240.bug_fixes.sparse>`. | ||
|
||
.. _whatsnew_0240.api_breaking.frame_to_dict_index_orient: | ||
|
||
Raise ValueError in ``DataFrame.to_dict(orient='index')`` | ||
|
@@ -573,6 +604,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your | |
- Added :meth:`pandas.api.types.register_extension_dtype` to register an extension type with pandas (:issue:`22664`) | ||
- Series backed by an ``ExtensionArray`` now work with :func:`util.hash_pandas_object` (:issue:`23066`) | ||
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`) | ||
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. really? we allow this. I agree this would be ok, but is reasonably tested? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Define reasonable :) I'm reasonably sure there are places in pandas where we assume we have an ndarray, but may get an ExtensionArray instead. The common case of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In any case, if you create the masks in other ways than |
||
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`) | ||
|
||
.. _whatsnew_0240.api.incompatibilities: | ||
|
@@ -655,6 +687,7 @@ Other API Changes | |
- :class:`pandas.io.formats.style.Styler` supports a ``number-format`` property when using :meth:`~pandas.io.formats.style.Styler.to_excel` (:issue:`22015`) | ||
- :meth:`DataFrame.corr` and :meth:`Series.corr` now raise a ``ValueError`` along with a helpful error message instead of a ``KeyError`` when supplied with an invalid method (:issue:`22298`) | ||
- :meth:`shift` will now always return a copy, instead of the previous behaviour of returning self when shifting by 0 (:issue:`22397`) | ||
- Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (:issue:`22784`) | ||
|
||
.. _whatsnew_0240.deprecations: | ||
|
||
|
@@ -896,13 +929,6 @@ Groupby/Resample/Rolling | |
- :func:`RollingGroupby.agg` and :func:`ExpandingGroupby.agg` now support multiple aggregation functions as parameters (:issue:`15072`) | ||
- Bug in :meth:`DataFrame.resample` and :meth:`Series.resample` when resampling by a weekly offset (``'W'``) across a DST transition (:issue:`9119`, :issue:`21459`) | ||
|
||
Sparse | ||
^^^^^^ | ||
|
||
- | ||
- | ||
- | ||
|
||
Reshaping | ||
^^^^^^^^^ | ||
|
||
|
@@ -921,6 +947,19 @@ Reshaping | |
- Bug in :func:`merge_asof` when merging on float values within defined tolerance (:issue:`22981`) | ||
- Bug in :func:`pandas.concat` when concatenating a multicolumn DataFrame with tz-aware data against a DataFrame with a different number of columns (:issue`22796`) | ||
|
||
.. _whatsnew_0240.bug_fixes.sparse: | ||
|
||
Sparse | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
^^^^^^ | ||
|
||
- Updating a boolean, datetime, or timedelta column to be Sparse now works (:issue:`22367`) | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. really, we have support for this? again i agree this is a nice feature, but we are decreasing support generally for sparse, so not anxious to advertise this |
||
- Bug in :meth:`Series.to_sparse` with Series already holding sparse data not constructing properly (:issue:`22389`) | ||
- Providing a ``sparse_index`` to the SparseArray constructor no longer defaults the na-value to ``np.nan`` for all dtypes. The correct na_value for ``data.dtype`` is now used. | ||
- Bug in ``SparseArray.nbytes`` under-reporting its memory usage by not including the size of its sparse index. | ||
- Improved performance of :meth:`Series.shift` for non-NA ``fill_value``, as values are no longer converted to a dense array. | ||
- Bug in ``DataFrame.groupby`` not including ``fill_value`` in the groups for non-NA ``fill_value`` when grouping by a sparse column (:issue:`5078`) | ||
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`) | ||
|
||
Build Changes | ||
^^^^^^^^^^^^^ | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -287,10 +287,25 @@ def astype(self, dtype, copy=True): | |
return np.array(self, dtype=dtype, copy=copy) | ||
|
||
def isna(self): | ||
# type: () -> np.ndarray | ||
"""Boolean NumPy array indicating if each value is missing. | ||
# type: () -> Union[ExtensionArray, np.ndarray] | ||
""" | ||
A 1-D array indicating if each value is missing. | ||
|
||
Returns | ||
------- | ||
na_values : Union[np.ndarray, ExtensionArray] | ||
In most cases, this should return a NumPy ndarray. For | ||
exceptional cases like ``SparseArray``, where returning | ||
an ndarray would be expensive, an ExtensionArray may be | ||
returned. | ||
|
||
Notes | ||
----- | ||
If returning an ExtensionArray, then | ||
|
||
This should return a 1-D array the same length as 'self'. | ||
* ``na_values._is_boolean`` should be True | ||
* `na_values` should implement :func:`ExtensionArray._reduce` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should probably have an Indexing EA mixin that implementes these as NotImplemented (so once can subclass) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Such an indexing mixing might be useful, but how is this related to the line above? |
||
* ``na_values.any`` and ``na_values.all`` should be implemented | ||
""" | ||
raise AbstractMethodError(self) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -93,11 +93,13 @@ def _get_series_result_type(result, objs=None): | |
def _get_frame_result_type(result, objs): | ||
""" | ||
return appropriate class of DataFrame-like concat | ||
if all blocks are SparseBlock, return SparseDataFrame | ||
if all blocks are sparse, return SparseDataFrame | ||
otherwise, return 1st obj | ||
""" | ||
|
||
if result.blocks and all(b.is_sparse for b in result.blocks): | ||
if (result.blocks and ( | ||
all(is_sparse(b) for b in result.blocks) or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. related to my comment above. cannot is_sparse not simply check if its an EA and if it has a Sparse Dtype? then you simply need to pass the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll give that a shot. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a comment here, its not obvious what you are doing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how can obj be a SparseFrame here? is this tested? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a comment of mine may have been lost. This is hit in several places (e.g. What part can I clarify here? |
||
all(isinstance(obj, ABCSparseDataFrame) for obj in objs))): | ||
from pandas.core.sparse.api import SparseDataFrame | ||
return SparseDataFrame | ||
else: | ||
|
@@ -554,61 +556,23 @@ def _concat_sparse(to_concat, axis=0, typs=None): | |
a single array, preserving the combined dtypes | ||
""" | ||
|
||
from pandas.core.sparse.array import SparseArray, _make_index | ||
from pandas.core.sparse.array import SparseArray | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def convert_sparse(x, axis): | ||
# coerce to native type | ||
if isinstance(x, SparseArray): | ||
x = x.get_values() | ||
else: | ||
x = np.asarray(x) | ||
x = x.ravel() | ||
if axis > 0: | ||
x = np.atleast_2d(x) | ||
return x | ||
fill_values = [x.fill_value for x in to_concat | ||
if isinstance(x, SparseArray)] | ||
|
||
if typs is None: | ||
typs = get_dtype_kinds(to_concat) | ||
if len(set(fill_values)) > 1: | ||
raise ValueError("Cannot concatenate SparseArrays with different " | ||
"fill values") | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
if len(typs) == 1: | ||
# concat input as it is if all inputs are sparse | ||
# and have the same fill_value | ||
fill_values = {c.fill_value for c in to_concat} | ||
if len(fill_values) == 1: | ||
sp_values = [c.sp_values for c in to_concat] | ||
indexes = [c.sp_index.to_int_index() for c in to_concat] | ||
|
||
indices = [] | ||
loc = 0 | ||
for idx in indexes: | ||
indices.append(idx.indices + loc) | ||
loc += idx.length | ||
sp_values = np.concatenate(sp_values) | ||
indices = np.concatenate(indices) | ||
sp_index = _make_index(loc, indices, kind=to_concat[0].sp_index) | ||
|
||
return SparseArray(sp_values, sparse_index=sp_index, | ||
fill_value=to_concat[0].fill_value) | ||
|
||
# input may be sparse / dense mixed and may have different fill_value | ||
# input must contain sparse at least 1 | ||
sparses = [c for c in to_concat if is_sparse(c)] | ||
fill_values = [c.fill_value for c in sparses] | ||
sp_indexes = [c.sp_index for c in sparses] | ||
|
||
# densify and regular concat | ||
to_concat = [convert_sparse(x, axis) for x in to_concat] | ||
result = np.concatenate(to_concat, axis=axis) | ||
|
||
if not len(typs - {'sparse', 'f', 'i'}): | ||
# sparsify if inputs are sparse and dense numerics | ||
# first sparse input's fill_value and SparseIndex is used | ||
result = SparseArray(result.ravel(), fill_value=fill_values[0], | ||
kind=sp_indexes[0]) | ||
else: | ||
# coerce to object if needed | ||
result = result.astype('object') | ||
return result | ||
fill_value = fill_values[0] | ||
|
||
# TODO: Fix join unit generation so we aren't passed this. | ||
to_concat = [x if isinstance(x, SparseArray) | ||
else SparseArray(x.squeeze(), fill_value=fill_value) | ||
for x in to_concat] | ||
|
||
return SparseArray._concat_same_type(to_concat) | ||
|
||
|
||
def _concat_rangeindex_same_dtype(indexes): | ||
|
Uh oh!
There was an error while loading. Please reload this page.