Skip to content

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Apr 14, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

I think this is mostly interesting in that it allows normalize=True for value_counts on an IntegerArray backed Series, which currently doesn't work:

[ins] In [1]: s = pd.Series([1, 2, 3], dtype="Int64")

[ins] In [2]: s.value_counts(normalize=True)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-2bf1a78353e5> in <module>
----> 1 s.value_counts(normalize=True)

~/pandas/pandas/core/base.py in value_counts(self, normalize, sort, ascending, bins, dropna)
   1252             normalize=normalize,
   1253             bins=bins,
-> 1254             dropna=dropna,
   1255         )
   1256         return result

~/pandas/pandas/core/algorithms.py in value_counts(values, sort, ascending, normalize, bins, dropna)
    725
    726     if normalize:
--> 727         result = result / float(counts.sum())
    728
    729     return result

AttributeError: 'IntegerArray' object has no attribute 'sum'


return result

def sum(self, skipna: bool = True, min_count: int = 0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should make the signature match PandasArray.sum etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, and we still want to keep axis (even though it wouldn't be used here)?


@pytest.mark.parametrize("skipna", [True, False])
@pytest.mark.parametrize("min_count", [0, 4])
def test_integer_array_sum(skipna, min_count):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is e.g. Series[Int64].sum() or DataFrame[Int64].sum() fixed by this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are actually already working, just not for IntegerArray specifically

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the correct way to fix this. series/frames already dispatch to EAs via _reduce, which DO implement sum/prod/min/max. The issue is these functions need to be exposed via that dispatch mechanism, rather than writing more code to actually do the reduction.

@jreback
Copy link
Contributor

jreback commented Apr 15, 2020

furthermore the numpy_ EA do this in an opposite way, meaning that the ops themselves are defined, and _reduce dispatches TO them.

So we should decide on a single way forward here. I think the way integer array does this is better, e.g. in _reduce.

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations labels Apr 15, 2020
@jorisvandenbossche
Copy link
Member

As mentioned in #33351 (comment), we should probably have a general discussion about how to deal with those reduction methods in our EAs, otherwise those comments like Jeff's concern above (#33538) are going to keep coming up in each PR that adds something like this.

dtype=None,
out=None,
keepdims=False,
initial=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove all those unnecessary keywords and put them in a **kwargs ?

@jreback jreback added this to the 1.1 milestone Apr 24, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @jorisvandenbossche agree we agree on the issues you mentioned, but happy with this for now.

@jorisvandenbossche jorisvandenbossche merged commit 77a0f19 into pandas-dev:master Apr 25, 2020
@jorisvandenbossche
Copy link
Member

thanks @dsaxton !

@jorisvandenbossche
Copy link
Member

@dsaxton would you like to open a general issue about this? (#33538 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants