Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: find categorical code against categorical label/value #48766

Open
1 of 3 tasks
stevenlis opened this issue Sep 25, 2022 · 12 comments
Open
1 of 3 tasks

ENH: find categorical code against categorical label/value #48766

stevenlis opened this issue Sep 25, 2022 · 12 comments
Assignees
Labels
Categorical Categorical Data Type Enhancement Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@stevenlis
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could check the underlying code for each value against a categorical column directly without indexing and using cat.codes

Assume I have the following dataframe

import pandas as pd
from pandas.api.types import CategoricalDtype

data = {
    'quarter': ['2019Q4', '2020Q1', '2020Q2', '2020Q3'],
    'num': [12, 23, 34, 67]
}
df = pd.DataFrame(data=data)

cat = CategoricalDtype(categories=data['quarter'], ordered=True)
df.quarter = df.quarter.astype(cat)

I need to select all the rows after 2020Q2. I have to first find the underlying code of the value/label 2020Q2, but I can only do so by indexing the dataframe against it and then use cat.codes, and then indexing the array return to get the first value. This is a little bit tedious.

c = df[df.quarter == '2020Q2'].quarter.cat.codes.values[0]
df[df.quarter.cat.codes > c]

Feature Description

Right now if you use df.quarter.dtype.categories, it only returns the categories as a list

Index(['2019Q4', '2020Q1', '2020Q2', '2020Q3'], dtype='object')

It would be great if there is a attribute to return a map of categories and codes together in a dictionary so that users could simply find the codes by using categories as dict keys
For example

df.df.quarter.dtype.cats_codes

returns

{'2019Q4': 1, '2020Q1': 2, '2020Q2': 3, '2020Q3': 4}

Alternative Solutions

Maybe it could also be a get_cat_code() function in pandas api so that users could input a category to get the underlying code, such as get_cat_code(cat='2020Q2')

Additional Context

No response

@stevenlis stevenlis added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 25, 2022
@jreback
Copy link
Contributor

jreback commented Sep 25, 2022

why? what are you actually trying to do

the codes are an implementation detail

@stevenlis
Copy link
Author

@jreback As I explain, to select rows above or below a certain code when you have a ordered categorical column.

@jreback
Copy link
Contributor

jreback commented Sep 25, 2022

these labels should already respond to the full suite of comparators eg

df[df.ordered_cat > 'value1'] should select values that are greater than in code space

@stevenlis
Copy link
Author

Indeed, you could do a semantic selection with a categorical, but it might still be helpful, let's say 3 quarters after ...
You could simply add 3 to a code. Right now, as far as I know, if you have to do that, you have to index a list .dtype.categories..index(value1) + 3 and then find the value/item in that list.

@jreback
Copy link
Contributor

jreback commented Sep 25, 2022

again these are an implementation detail - you can use them but -1 on adding api beyond which already exists

the semantic selections are pretty useful here ; it's not clear why you cannot simply use these

@stevenlis
Copy link
Author

It does not exist... the codes has more use cases than just an implementation detail. For example, if you need to run a regression mode, you can simply use cat.codes to make your input numerical instead of string. It would be helpful to figure out what the code is for each value in that variable. Right now, there is no way to easily know how each of the values is coded other than cat.codes, which is a series method, and you have to index your entire dataframe to use it.

@WillAyd
Copy link
Member

WillAyd commented Oct 3, 2022

I think this could also be useful when you want to maintain a CategoricalDtype for roundtripping some IO formats. With SQL as an example, the CategoricalDtype does the "right thing" when you just build a dataframe and write it, but if you want to issue a WHERE clause on return that only filtered to a subset of your Dtype it becomes difficult to get access to those codes.

I could see it being useful for CategoricalDtype to behave more like an Enum in this instance

@WillAyd
Copy link
Member

WillAyd commented Oct 6, 2022

@StevenLi-DS to clarify this is what I have in mind:

enum.Enum("AnEnum", cat)

Currently this yields TypeError: 'CategoricalDtype' object is not iterable but might be a natural Pythonic way to get what you are ultimately after, without requiring pandas to generate a larger API footprint. Would you be interested in exploring that more?

@stevenlis
Copy link
Author

Thank you @WillAyd. I'm not familiar with enum, but think return a dict would give us more usability and flexibility.

@JgLemos
Copy link

JgLemos commented Oct 27, 2022

Hello, this issue is available to take?

@WillAyd
Copy link
Member

WillAyd commented Oct 28, 2022

@JgLemos sure still open

@JgLemos
Copy link

JgLemos commented Oct 31, 2022

take

@simonjayhawkins simonjayhawkins added the Categorical Categorical Data Type label Feb 6, 2024
@mroeschke mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

6 participants