Skip to content

ENH: Add DataFrameGroupBy.value_counts #44267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 140 commits into from
Dec 19, 2021
Merged
Changes from 1 commit
Commits
Show all changes
140 commits
Select commit Hold shift + click to select a range
963b7e1
Add DataFrameGroupBy.value_counts
johnzangwill Nov 1, 2021
1f710e0
Update test_frame_value_counts.py
johnzangwill Nov 1, 2021
3531383
Catch axis=1
johnzangwill Nov 1, 2021
d7f733b
Add to base and tab_completion
johnzangwill Nov 1, 2021
eb067ec
Line too long
johnzangwill Nov 1, 2021
a6a07d1
Update test_frame_value_counts.py
johnzangwill Nov 1, 2021
6a22a57
Add docstring
johnzangwill Nov 1, 2021
9492ee4
Update generic.py
johnzangwill Nov 1, 2021
b9885fd
Update groupby.rst
johnzangwill Nov 1, 2021
e896879
generic.py types
johnzangwill Nov 1, 2021
6de9653
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 1, 2021
5b49322
Add observed parameter
johnzangwill Nov 1, 2021
651b20b
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 1, 2021
26353ee
Change output name to "count" and deal with categorical data
johnzangwill Nov 3, 2021
9f44a6d
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 3, 2021
0e065b3
Update generic.py
johnzangwill Nov 3, 2021
b821fca
Add test_categorical
johnzangwill Nov 3, 2021
19d7257
Update test_frame_value_counts.py
johnzangwill Nov 3, 2021
71ee5f4
Add by=function test
johnzangwill Nov 3, 2021
1dd2db0
Update test_frame_value_counts.py
johnzangwill Nov 3, 2021
1c18d7d
Update test_frame_value_counts.py
johnzangwill Nov 3, 2021
f25e861
Update generic.py
johnzangwill Nov 3, 2021
faac0f0
Update test_frame_value_counts.py
johnzangwill Nov 3, 2021
3934042
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 4, 2021
4904c31
Merge branch 'master' into DataFrameGroupBy.value_counts
johnzangwill Nov 5, 2021
0f615da
Update v1.4.0.rst
johnzangwill Nov 5, 2021
c2db74f
Merge branch 'master' into DataFrameGroupBy.value_counts
johnzangwill Nov 6, 2021
ba793bb
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 6, 2021
221b76a
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 7, 2021
424d7a6
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 9, 2021
5216929
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 12, 2021
50d4c59
Reset index after sorting
johnzangwill Nov 14, 2021
9b2869f
Toughen up testing for groupers in keys
johnzangwill Nov 14, 2021
3de6132
De-numpy most of the tests
johnzangwill Nov 14, 2021
a9c2b83
Update test_frame_value_counts.py
johnzangwill Nov 15, 2021
0ad5ffb
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 15, 2021
6905bcd
Better detection of non-column grouping
johnzangwill Nov 15, 2021
0281539
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 15, 2021
eb9600f
Finish de-numpying the tests
johnzangwill Nov 15, 2021
f529714
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 15, 2021
0ae5218
Dropna changes
johnzangwill Nov 15, 2021
15e3167
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill Nov 15, 2021
dfa82cb
Update generic.py
johnzangwill Nov 15, 2021
6e2b06e
Add bad subset trap and test
johnzangwill Nov 15, 2021
2dc5972
Update generic.py
johnzangwill Nov 15, 2021
925d3ec
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 16, 2021
82730f1
Merge branch 'master' into DataFrameGroupBy.value_counts
johnzangwill Nov 17, 2021
d7b3149
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 17, 2021
8d8d9b0
Add more dropna tests and workaround Series bug
johnzangwill Nov 20, 2021
c12d831
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill Nov 20, 2021
57d3fb8
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 20, 2021
c431953
Reformat
johnzangwill Nov 20, 2021
4d10e47
Update generic.py
johnzangwill Nov 20, 2021
e4582ef
Typing fix
johnzangwill Nov 21, 2021
c948274
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 21, 2021
2a58c42
Update generic.py
johnzangwill Nov 21, 2021
e1596b1
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 22, 2021
df76279
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 22, 2021
04ebe65
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 22, 2021
f179fbb
Update test_frame_value_counts.py
johnzangwill Nov 22, 2021
98355d5
Replace self.as_index==False code with reset_index()
johnzangwill Nov 23, 2021
0be0150
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 23, 2021
25edd1e
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 23, 2021
97ab9c1
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 23, 2021
6ac9356
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 24, 2021
45b99af
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 24, 2021
0cbb3e2
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 25, 2021
fada9a9
Remove Series name and change column name
johnzangwill Nov 25, 2021
8e3f359
Change non_column_grouping
johnzangwill Nov 25, 2021
ca15937
Update test_frame_value_counts.py
johnzangwill Nov 25, 2021
f055323
Update generic.py
johnzangwill Nov 25, 2021
417958d
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 25, 2021
86a0df6
Correct docstring example
johnzangwill Nov 26, 2021
7d29bd4
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 26, 2021
32f4b6f
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 26, 2021
c13eef0
Improve bad subset message
johnzangwill Nov 26, 2021
2c2eb0a
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 26, 2021
7638086
Update generic.py
johnzangwill Nov 28, 2021
57b564b
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 28, 2021
aa3cb98
Update generic.py
johnzangwill Nov 28, 2021
5e5d7e7
Update generic.py
johnzangwill Nov 28, 2021
09cee2f
Add mixed grouping test
johnzangwill Nov 29, 2021
5838066
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 29, 2021
8f81bd2
Trigger CI
johnzangwill Nov 29, 2021
92cb494
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 29, 2021
95ccdb4
Trigger CI
johnzangwill Nov 29, 2021
085e8c9
Some refinements
rhshadrach Nov 30, 2021
9fcfbfe
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Nov 30, 2021
bb5f82a
Trigger CI
johnzangwill Nov 30, 2021
377cee0
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill Nov 30, 2021
c824f3e
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 2, 2021
92c718b
Merge pull request #5 from rhshadrach/DataFrameGroupBy.value_counts
johnzangwill Dec 2, 2021
14d8172
Add test_column_name_clashes
johnzangwill Dec 2, 2021
e26cba1
Update test_frame_value_counts.py
johnzangwill Dec 2, 2021
928a9d7
Update test_frame_value_counts.py
johnzangwill Dec 2, 2021
ad0f5b4
Trigger CI
johnzangwill Dec 2, 2021
e827cd3
Trigger CI
johnzangwill Dec 2, 2021
2c2b967
reset_index to cope with duplicate labels
johnzangwill Dec 3, 2021
51a3a3e
Update frame.py
johnzangwill Dec 3, 2021
2ee133e
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 3, 2021
ec2a2d4
Update test_put.py
johnzangwill Dec 3, 2021
9d330d1
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill Dec 3, 2021
8e4f3ed
Comment out tests that now pass
johnzangwill Dec 3, 2021
b2c61de
Update test_reset_index.py
johnzangwill Dec 3, 2021
392986d
Trigger CI
johnzangwill Dec 3, 2021
3b2ac58
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 4, 2021
34e6529
Update generic.py
johnzangwill Dec 4, 2021
91e1ff3
Update test_reset_index.py
johnzangwill Dec 4, 2021
06aaaeb
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 5, 2021
e062823
Improve test imports
johnzangwill Dec 5, 2021
a8b0fc5
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill Dec 5, 2021
493e3aa
Update test_frame_value_counts.py
johnzangwill Dec 5, 2021
548c45b
Update test_frame_value_counts.py
johnzangwill Dec 5, 2021
6c19ce2
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 5, 2021
6141f85
Revert changes to reset_index()
johnzangwill Dec 6, 2021
050f070
Update frame.py
johnzangwill Dec 6, 2021
d669af3
Add reset_index failure to test
johnzangwill Dec 6, 2021
6c0d7f8
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 6, 2021
de68836
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 8, 2021
c81adb6
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 10, 2021
08fd6ab
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 10, 2021
dc67009
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 10, 2021
d023579
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 11, 2021
71d9780
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 11, 2021
b93f47c
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 11, 2021
db31257
Add grouping test
johnzangwill Dec 12, 2021
124b1e9
Trigger CI
johnzangwill Dec 12, 2021
d613261
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 14, 2021
a776a3d
Trigger CI
johnzangwill Dec 14, 2021
4ef5ea0
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 14, 2021
0f0891f
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 15, 2021
5c1d021
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 16, 2021
fe58245
Update generic.py
johnzangwill Dec 17, 2021
11ad6ea
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 17, 2021
857e5be
Update generic.py
johnzangwill Dec 18, 2021
5b9d85a
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill Dec 18, 2021
e226547
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 18, 2021
c8f1731
Trigger CI
johnzangwill Dec 18, 2021
8f89580
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 18, 2021
ac38571
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill Dec 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add observed parameter
  • Loading branch information
johnzangwill committed Nov 1, 2021
commit 5b49322b300c7c620f692b5efc9e5db8bea3aa9e
3 changes: 2 additions & 1 deletion pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1703,13 +1703,14 @@ def value_counts(
groupings,
as_index=self.as_index,
sort=self.sort,
observed=self.observed,
dropna=dropna,
).size()
result.name = "size"

if normalize:
indexed_group_size = df.groupby(
grouper, sort=self.sort, dropna=dropna
grouper, sort=self.sort, observed=self.observed, dropna=dropna
).size()
if self.as_index:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass as_index=True above always, use this block to normalize, and then call reset_index when as_index is False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what you mean but I don't think so. The as_index value results in a completely different data type and structure.
With the default as_index=True then size() and value_counts() return a Series.
With as_index=False then the result is a DataFrame with the same columns as the original plus a new one with the results, that I have labeled "count".
So both normalize and sort have to be done completely differently in the two cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is it the case that

result = df.groupby(keys, as_index=False).value_counts()
expected = df.groupby(keys).value_counts().reset_index()
pd.testing.assert_frame_equal(result, expected)

should raise? I'm seeing this PR does raise due to the order of the results, but I don't believe that is intentional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be surprised if they were different, since I use GroupBy.size() internally. I regarded value_counts() as a sort of extension of size(), and therfore following its behaviour in the two cases.

That being said, I have tried a variety of dfs and key combinations and can't reproduce the order change you describe. Please could you give me a complete example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be surprised if they were different,...

Is that the behavior we want? My expectation is that as_index would simply change where the groups are place (index vs columns), and not the order of the result. Am I missing something?

I regarded value_counts() as a sort of extension of size(), and therfore following its behaviour in the two cases.

I cannot replicate this behavior (value of as_index changing result order) with .size, and if I could, would think it is a bug.

Please could you give me a complete example.

Sure! I've played around a bunch, and size, bins = 17, 3 in the code below seems to be minimal.

pd.DataFrame(
    {
        "a": list("10110210212102210"),
        "b": list("02112210220221000"),
        "c": list("01120201020021212"),
    }
).astype(int)

Produced by

import numpy as np
import pandas as pd
import pandas._testing as tm

size, bins = 17, 3
for _ in range(1000):
    df = pd.DataFrame({k: np.random.randint(bins, size=size) for k in 'abc'})
    result = df.groupby(['a'], as_index=False).value_counts()
    expected = df.groupby(['a']).value_counts().reset_index()
    # result = result.sort_values(list('abc')).reset_index(drop=True)
    # expected = expected.sort_values(list('abc')).reset_index(drop=True)
    try:
        tm.assert_frame_equal(result, expected)
    except Exception:
        print(pd.concat([result, expected], axis=1))
        a = ''.join(df.a.astype(str).tolist())
        b = ''.join(df.b.astype(str).tolist())
        c = ''.join(df.c.astype(str).tolist())
        rep = (
            f'pd.DataFrame(\n'
            '    {\n'
            f'        "a": list("{a}"),\n'
            f'        "b": list("{b}"),\n'
            f'        "c": list("{c}"),\n'
            '    }\n'
            ').astype(int)\n'
        )
        tm.assert_frame_equal(eval(rep), df)
        print(rep)

        break

Copy link
Member

@rhshadrach rhshadrach Nov 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we do agree order of results should be the same for either value of as_index, then I think we can significantly simplify the implementation by following what _wrap_aggregated_output does, computing the result with as_index=True and then adjusting the index as necessary afterwards.

if not self.as_index:
# `not self.as_index` is only relevant for DataFrameGroupBy,
# enforced in __init__
self._insert_inaxis_grouper_inplace(result)
result = result._consolidate()
index = Index(range(self.grouper.ngroups))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You convinced me! We are throwing away any initial index anyway.
I got rid of all the tricky frame code, work it all out with as_index=True and then reset_index() at the end.
At least it passes all of my tests...

if index_grouping:
Expand Down