Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: show category[dtype] #57281

Closed
1 of 3 tasks
VladimirFokow opened this issue Feb 6, 2024 · 1 comment
Closed
1 of 3 tasks

ENH: show category[dtype] #57281

VladimirFokow opened this issue Feb 6, 2024 · 1 comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@VladimirFokow
Copy link
Contributor

VladimirFokow commented Feb 6, 2024

(in process of writing it) please comment if it's a good / bad idea

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Would it be good to also show the Categorie's dtype, for example like this?:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'a': pd.Categorical([1, 2, 3]),
    'b': pd.Categorical([1, 2, np.nan]),
    'b': pd.Categorical(np.array([1, 2, np.nan])),
    'c': pd.Categorical(['a', 'b', 'c'])
})
print(df.dtypes)

Currently:

a    category
b    category
c    category
d    category
dtype: object

Proposed:

a      category<int64>
b      category<int64>
c    category<float64>
d     category<object>
dtype: object

or whichever alternative looks best, e.g.:

category<int64>
category[int64]
category(int64)
category{int64}

<int64>category
[int64]category
(int64)category
{int64}category

Advantages of the proposed way:

- Differentiate CategoricalDtype of different different dtypes between each other when viewed in df.dtypes

Inside, each of these categories are actually different dtypes, so category can be thought of as a "meta-dtype": while two columns of int64 are of the same dtype, the columns of different category can have different category dtypes.

dtype "category" in reality means: one of CategoricalDtype objects.

While this enhancement wouldn't completely differentiate them, it would do it at least in some cases (when the dtypes of their categories are different).


- When merging two categorical columns: more convenience, fewer opportunities for mistakes

For example, when merging on a categorical column, this can be a useful hint.

Helpful reading: medium article, section "Merging two categorical columns". In summary:

Joining on columns -> produces

cat1 cat2 -> obj
cat1 cat1 -> cat1

While showing the dtypes alone wouldn't completely solve this, this can be a helpful improvement in user experience - to seeing the internal dtypes with df.dtypes, instead of with df['column_name'].dtype and seeing in parenthesis, or df['column_name'].cat.categories.dtype or df['column_name'].dtype.categories.dtype


TODO: read that article - any other advantages / quality-of-life improvements of this proposal?


- Conversion from categorical column to normal - unclear reason for error

# Preparation of the situation:
df['b'] = ad['b'].cat.add_categories('new_cat')

# The situation:
print(df)   # see that data in both `a` and `b` are integers
print(df.dtypes)  # see that the dtypes both `a` and `b` are "category" -> want to convert
df['a'].astype(int)  # works
df['b'].astype(int)  # doesn't work, saying: `ValueError: Cannot cast object dtype to int64`

Confusing error: I wanted to cast a categorical - not an object.
(although it's quite easy to discover the reason: `print(df['b']) includes "object"

The reason: inside of a categorical is an object (although 'new_cat' is an unused category, it still prevents conversion to int). We can know this with: df['b'] or df['b'].dtype


- better clarity

dtypes inside Categorical can have behavior different from numpy. Seeing the dtypes clearly -> Fewer user surprises

pd.Categorical([3, 1, np.nan])   # CategoricalDtype int64
# while:
pd.Categorical(np.array([1, 2, np.nan]))  # CategoricalDtype of type float64

Drawbacks:

  • longer, not so clean dtype to read

  • can still be confusing: even in cases where category<dtype> is same, CategoryDtypes can still be different objects


What do you think?

What are some of the advantages or drawbacks of showing the dtypes of categories?

Is this interesting at all?

Feature Description

When a df contains a categorical column, and the user prints df.dtypes, instead of

category

show:

category[dtype]

where dtype is the categories_dtype when printing ad['column_name'].dtype


Alternative Solutions (that currently exist)

df['a']  # last line of output contains the dtype: Categories (3, int64): [1, 2, 3]

Or any of these also show the internal dtype:

df['a'].dtype  # CategoricalDtype(categories=['a', 'b', 'c'], ordered=False, categories_dtype=int64)

df['a'].cat.categories # Index([1, 2, 3], dtype='int64')
df['a'].cat.categories.dtype  # dtype('int64')

df['a'].dtype.categories # Index([1, 2, 3], dtype='int64')
df['a'].dtype.categories.dtype  # dtype('int64')

Additional Context

#48515 wanted the ability to set the cat dtype, my request is just for displaying it

#57259 wants to fix the docs for categorical dtype equality - dtype is important for their equality, so seeing it is useful (although insufficient, still useful)

@VladimirFokow VladimirFokow added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 6, 2024
@VladimirFokow
Copy link
Contributor Author

VladimirFokow commented Feb 6, 2024

currently in status of writing it, can reopen when ready;

please comment: is it interesting? Should I continue working on it? Is it a good / bad idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant