-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
ENH: Add DataFrameGroupBy.value_counts #44267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jreback
merged 140 commits into
pandas-dev:master
from
johnzangwill:DataFrameGroupBy.value_counts
Dec 19, 2021
Merged
Changes from 1 commit
Commits
Show all changes
140 commits
Select commit
Hold shift + click to select a range
963b7e1
Add DataFrameGroupBy.value_counts
johnzangwill 1f710e0
Update test_frame_value_counts.py
johnzangwill 3531383
Catch axis=1
johnzangwill d7f733b
Add to base and tab_completion
johnzangwill eb067ec
Line too long
johnzangwill a6a07d1
Update test_frame_value_counts.py
johnzangwill 6a22a57
Add docstring
johnzangwill 9492ee4
Update generic.py
johnzangwill b9885fd
Update groupby.rst
johnzangwill e896879
generic.py types
johnzangwill 6de9653
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 5b49322
Add observed parameter
johnzangwill 651b20b
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 26353ee
Change output name to "count" and deal with categorical data
johnzangwill 9f44a6d
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 0e065b3
Update generic.py
johnzangwill b821fca
Add test_categorical
johnzangwill 19d7257
Update test_frame_value_counts.py
johnzangwill 71ee5f4
Add by=function test
johnzangwill 1dd2db0
Update test_frame_value_counts.py
johnzangwill 1c18d7d
Update test_frame_value_counts.py
johnzangwill f25e861
Update generic.py
johnzangwill faac0f0
Update test_frame_value_counts.py
johnzangwill 3934042
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 4904c31
Merge branch 'master' into DataFrameGroupBy.value_counts
johnzangwill 0f615da
Update v1.4.0.rst
johnzangwill c2db74f
Merge branch 'master' into DataFrameGroupBy.value_counts
johnzangwill ba793bb
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 221b76a
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 424d7a6
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 5216929
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 50d4c59
Reset index after sorting
johnzangwill 9b2869f
Toughen up testing for groupers in keys
johnzangwill 3de6132
De-numpy most of the tests
johnzangwill a9c2b83
Update test_frame_value_counts.py
johnzangwill 0ad5ffb
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 6905bcd
Better detection of non-column grouping
johnzangwill 0281539
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill eb9600f
Finish de-numpying the tests
johnzangwill f529714
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 0ae5218
Dropna changes
johnzangwill 15e3167
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill dfa82cb
Update generic.py
johnzangwill 6e2b06e
Add bad subset trap and test
johnzangwill 2dc5972
Update generic.py
johnzangwill 925d3ec
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 82730f1
Merge branch 'master' into DataFrameGroupBy.value_counts
johnzangwill d7b3149
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 8d8d9b0
Add more dropna tests and workaround Series bug
johnzangwill c12d831
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill 57d3fb8
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill c431953
Reformat
johnzangwill 4d10e47
Update generic.py
johnzangwill e4582ef
Typing fix
johnzangwill c948274
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 2a58c42
Update generic.py
johnzangwill e1596b1
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill df76279
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 04ebe65
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill f179fbb
Update test_frame_value_counts.py
johnzangwill 98355d5
Replace self.as_index==False code with reset_index()
johnzangwill 0be0150
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 25edd1e
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 97ab9c1
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 6ac9356
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 45b99af
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 0cbb3e2
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill fada9a9
Remove Series name and change column name
johnzangwill 8e3f359
Change non_column_grouping
johnzangwill ca15937
Update test_frame_value_counts.py
johnzangwill f055323
Update generic.py
johnzangwill 417958d
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 86a0df6
Correct docstring example
johnzangwill 7d29bd4
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 32f4b6f
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill c13eef0
Improve bad subset message
johnzangwill 2c2eb0a
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 7638086
Update generic.py
johnzangwill 57b564b
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill aa3cb98
Update generic.py
johnzangwill 5e5d7e7
Update generic.py
johnzangwill 09cee2f
Add mixed grouping test
johnzangwill 5838066
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 8f81bd2
Trigger CI
johnzangwill 92cb494
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 95ccdb4
Trigger CI
johnzangwill 085e8c9
Some refinements
rhshadrach 9fcfbfe
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill bb5f82a
Trigger CI
johnzangwill 377cee0
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill c824f3e
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 92c718b
Merge pull request #5 from rhshadrach/DataFrameGroupBy.value_counts
johnzangwill 14d8172
Add test_column_name_clashes
johnzangwill e26cba1
Update test_frame_value_counts.py
johnzangwill 928a9d7
Update test_frame_value_counts.py
johnzangwill ad0f5b4
Trigger CI
johnzangwill e827cd3
Trigger CI
johnzangwill 2c2b967
reset_index to cope with duplicate labels
johnzangwill 51a3a3e
Update frame.py
johnzangwill 2ee133e
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill ec2a2d4
Update test_put.py
johnzangwill 9d330d1
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill 8e4f3ed
Comment out tests that now pass
johnzangwill b2c61de
Update test_reset_index.py
johnzangwill 392986d
Trigger CI
johnzangwill 3b2ac58
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 34e6529
Update generic.py
johnzangwill 91e1ff3
Update test_reset_index.py
johnzangwill 06aaaeb
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill e062823
Improve test imports
johnzangwill a8b0fc5
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill 493e3aa
Update test_frame_value_counts.py
johnzangwill 548c45b
Update test_frame_value_counts.py
johnzangwill 6c19ce2
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 6141f85
Revert changes to reset_index()
johnzangwill 050f070
Update frame.py
johnzangwill d669af3
Add reset_index failure to test
johnzangwill 6c0d7f8
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill de68836
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill c81adb6
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 08fd6ab
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill dc67009
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill d023579
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 71d9780
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill b93f47c
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill db31257
Add grouping test
johnzangwill 124b1e9
Trigger CI
johnzangwill d613261
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill a776a3d
Trigger CI
johnzangwill 4ef5ea0
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 0f0891f
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 5c1d021
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill fe58245
Update generic.py
johnzangwill 11ad6ea
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill 857e5be
Update generic.py
johnzangwill 5b9d85a
Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…
johnzangwill e226547
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill c8f1731
Trigger CI
johnzangwill 8f89580
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill ac38571
Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts
johnzangwill File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Add observed parameter
- Loading branch information
commit 5b49322b300c7c620f692b5efc9e5db8bea3aa9e
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass
as_index=True
above always, use this block to normalize, and then call reset_index whenas_index
is False?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure what you mean but I don't think so. The as_index value results in a completely different data type and structure.
With the default as_index=True then size() and value_counts() return a Series.
With as_index=False then the result is a DataFrame with the same columns as the original plus a new one with the results, that I have labeled "count".
So both normalize and sort have to be done completely differently in the two cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is it the case that
should raise? I'm seeing this PR does raise due to the order of the results, but I don't believe that is intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be surprised if they were different, since I use GroupBy.size() internally. I regarded value_counts() as a sort of extension of size(), and therfore following its behaviour in the two cases.
That being said, I have tried a variety of dfs and key combinations and can't reproduce the order change you describe. Please could you give me a complete example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the behavior we want? My expectation is that
as_index
would simply change where the groups are place (index vs columns), and not the order of the result. Am I missing something?I cannot replicate this behavior (value of as_index changing result order) with
.size
, and if I could, would think it is a bug.Sure! I've played around a bunch, and
size, bins = 17, 3
in the code below seems to be minimal.Produced by
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if we do agree order of results should be the same for either value of
as_index
, then I think we can significantly simplify the implementation by following what_wrap_aggregated_output
does, computing the result withas_index=True
and then adjusting the index as necessary afterwards.pandas/pandas/core/groupby/groupby.py
Lines 1135 to 1140 in 52c9181
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You convinced me! We are throwing away any initial index anyway.
I got rid of all the tricky frame code, work it all out with
as_index=True
and thenreset_index()
at the end.At least it passes all of my tests...