Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: SeriesGroupBy.value_counts - index name missing when applied on categorical column #44324

Closed
2 of 3 tasks
donok1 opened this issue Nov 5, 2021 · 3 comments · Fixed by #45625
Closed
2 of 3 tasks
Assignees
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type Groupby Index Related to the Index class or subclasses
Milestone

Comments

@donok1
Copy link

donok1 commented Nov 5, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame({'gender': ['female', 'male', 'female', 'male', 'female', 'male'],
                  'education': ['low', 'medium', 'high', 'low', 'high', 'low'],
                  'country': ['US', 'FR', 'US', 'FR', 'FR', 'US']})
df.dtypes  
gender       object
education    object
country      object
dtype: object

df.groupby('country')['gender'].value_counts()

country  gender
FR       male      2
         female    1
US       female    2
         male      1
Name: gender, dtype: int64

df.groupby('country')['gender'].value_counts().index.names

FrozenList(['country', 'gender'])

Now, if using a categorical type on the genre column, the column name is not saved in the index:

df['gender'] = df['gender'].astype('category')
df.dtypes

gender       category
education      object
country        object
dtype: object

df.groupby('country')['gender'].value_counts()

country
FR       male      2
         female    1
US       female    2
         male      1
Name: gender, dtype: int64

df.groupby('country')['gender'].value_counts().index.names

FrozenList(['country', None])

Issue Description

The column name (here gender) is dropped when applying value_counts() on a column of type category.

It might be link with other problems in handling categorical types: #44001

This was working as of pandas v 1.2.3, and I could reproduce the bug starting with v 1.3.0, still present in v 1.3.4

Expected Behavior

Column name should be added in the index for uniformity with other types, as in the first example above with object type column.

Installed Versions

pd.show_versions() INSTALLED VERSIONS ------------------ commit : 945c9ed python : 3.9.6.final.0 python-bits : 64 OS : Darwin OS-release : 21.1.0 Version : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8

pandas : 1.3.4
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 57.0.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : 0.8.9
xarray : 0.19.0
xlrd : None
xlwt : None
numba : None

@donok1 donok1 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 5, 2021
@donok1 donok1 changed the title BUG: SeriesGroupBy.value_counts - name missing when applied on categorical column BUG: SeriesGroupBy.value_counts - index name missing when applied on categorical column Nov 5, 2021
@mroeschke mroeschke added Categorical Categorical Data Type Groupby Index Related to the Index class or subclasses and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 6, 2021
@NumberPiOso
Copy link
Contributor

take

@NumberPiOso
Copy link
Contributor

I found this minimal example

In [1]: import pandas as pd
   ...: 
   ...: df = pd.DataFrame(
   ...:     {
   ...:         "gender": ["female"],
   ...:         "country": ["US"],
   ...:     }
   ...: )
   ...: 
   ...: print(df.groupby("country")["gender"].value_counts())
   ...: df["gender"] = df["gender"].astype("category")
   ...: print("*" * 60)
   ...: print(df.groupby("country")["gender"].value_counts())
country  gender
US       female    1
Name: gender, dtype: int64
************************************************************
country        
US       female    1
Name: gender, dtype: int64

@NumberPiOso
Copy link
Contributor

In particular the problem is at the function value_counts in the pandas/core/groupby/generic.py.

In PR #38796 the following code has been introduced:[

if bins is not None:
if not np.iterable(bins):
# scalar bins cannot be done at top level
# in a backward compatible way
return apply_series_value_counts()
elif is_categorical_dtype(val.dtype):
# GH38672
return apply_series_value_counts()

If I simply delete this code the function will solve our problem. Producing the following output:

In [1]: import pandas as pd
   ...: 
   ...: df = pd.DataFrame(
   ...:     {
   ...:         "gender": ["female"],
   ...:         "country": ["US"],
   ...:     }
   ...: )
   ...: 
   ...: print(df.groupby("country")["gender"].value_counts())
   ...: df["gender"] = df["gender"].astype("category")
   ...: print("*" * 60)
   ...: print(df.groupby("country")["gender"].value_counts())
country  gender
US       female    1
Name: gender, dtype: int64
************************************************************
country  gender    <-------------------------------------------
US       female    1
Name: gender, dtype: int64

However we would fall back into the problem solved at PR #38796.

I will investigate further solutions.

NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 25, 2022
@rhshadrach rhshadrach added this to the 1.5 milestone Feb 5, 2022
@rhshadrach rhshadrach added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Feb 5, 2022
mroeschke pushed a commit that referenced this issue Feb 5, 2022
…olumns (#45625)

* BUG: SeriesGroupBy.value_counts index name missing

Issue  #44324

* TST: Change test to correct categorical naming

Value counts tend to preserve index names #45625
Change test test_sorting_with_different_categoricals to comply
to this change

* REF: Refactor conditionals in value_counts()

* RFT: correct mistake introduced via RFT

In line with 44324

* RFT: Change variable names and comment #38672

* BUG: Update conditional to is None to consider series
phofl pushed a commit to phofl/pandas that referenced this issue Feb 14, 2022
…olumns (pandas-dev#45625)

* BUG: SeriesGroupBy.value_counts index name missing

Issue  pandas-dev#44324

* TST: Change test to correct categorical naming

Value counts tend to preserve index names pandas-dev#45625
Change test test_sorting_with_different_categoricals to comply
to this change

* REF: Refactor conditionals in value_counts()

* RFT: correct mistake introduced via RFT

In line with 44324

* RFT: Change variable names and comment pandas-dev#38672

* BUG: Update conditional to is None to consider series
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this issue Jul 13, 2022
…olumns (pandas-dev#45625)

* BUG: SeriesGroupBy.value_counts index name missing

Issue  pandas-dev#44324

* TST: Change test to correct categorical naming

Value counts tend to preserve index names pandas-dev#45625
Change test test_sorting_with_different_categoricals to comply
to this change

* REF: Refactor conditionals in value_counts()

* RFT: correct mistake introduced via RFT

In line with 44324

* RFT: Change variable names and comment pandas-dev#38672

* BUG: Update conditional to is None to consider series
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type Groupby Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants