Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: CategoricalDtype equality semantics aren't completely described #57259

Closed
1 task done
VladimirFokow opened this issue Feb 5, 2024 · 12 comments · Fixed by #57273
Closed
1 task done

DOC: CategoricalDtype equality semantics aren't completely described #57259

VladimirFokow opened this issue Feb 5, 2024 · 12 comments · Fixed by #57273
Assignees
Labels

Comments

@VladimirFokow
Copy link
Contributor

VladimirFokow commented Feb 5, 2024

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/user_guide/categorical.html#equality-semantics

https://github.com/pandas-dev/pandas/blob/main/pandas/core/dtypes/dtypes.py#L407

Documentation problem

Problematic statement 1:

Two instances of CategoricalDtype compare equal whenever they have the same categories and order.

Problematic statement 2:

  1. A CDT with ordered={False, None} is only equal to another CDT with
    ordered={False, None} and identical categories.

Counter-example:

>>> a = pd.Categorical(np.full(2, np.nan, dtype=object))
>>> b = pd.Categorical(np.full(2, np.nan))

>>> a, b
([NaN, NaN]
 Categories (0, object): [],
 [NaN, NaN]
 Categories (0, float64): [])

>>> a.dtype, b.dtype
(CategoricalDtype(categories=[], ordered=False, categories_dtype=object),
 CategoricalDtype(categories=[], ordered=False, categories_dtype=float64))

>>> a.dtype == b.dtype
False

As we can see, they both have ordered=False, and their categories are same.
Following the documentation, they should be equal.

Suggested fix for documentation

to have accurate and exhaustive descriptions

@VladimirFokow VladimirFokow added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 5, 2024
@rhshadrach
Copy link
Member

rhshadrach commented Feb 5, 2024

Thanks for the report.

As we can see, they both have ordered=False, and their categories are same.

Though both empty, two arrays of different dtypes are not the same.

I suppose "including their dtypes" would be okay to add.

@rhshadrach rhshadrach added Categorical Categorical Data Type good first issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 5, 2024
@luke396
Copy link
Contributor

luke396 commented Feb 6, 2024

take

@VladimirFokow
Copy link
Contributor Author

VladimirFokow commented Feb 6, 2024

thanks @rhshadrach !
I don't think this is a good first issue, because someone with an accurate and full conceptual understanding of Categorical and CategoricalDtype should help edit these descriptions.
I expand on this in PR #57273
(but on the other hand, maybe it is good - for people to figure this out and then write their findings)

@dvl-mehnaz
Copy link

Capture2
#57259

@VladimirFokow
Copy link
Contributor Author

VladimirFokow commented Feb 6, 2024

@Mehnaz02
dtype in b in my example is float64
(the whole point is a and b having different dtypes in their CategoricalDtypes):

# when
a.categories.dtype
# is different from:
b.categories.dtype

(btw, in the latest pandas version a.dtype also prints the categories_dtype at the end)

@VladimirFokow
Copy link
Contributor Author

VladimirFokow commented Feb 6, 2024

I can try starting to fix it, but definitely much later..
and would need a technical check for accuracy.

@rhshadrach
Copy link
Member

rhshadrach commented Feb 7, 2024

I don't think this is a good first issue, because someone with an accurate and full conceptual understanding of Categorical and CategoricalDtype should help edit these descriptions.

I disagree with this assessment. I believe one just needs to modify the documentation to state that the categories must be the equal as indexes, and that in particular includes their dtype. Am I missing something?

In addition, all PRs are reviewed, and so help can be given as part of that review.

@luke396
Copy link
Contributor

luke396 commented Feb 7, 2024

all PRs are reviewed, and so help can be given as part of that review.

As a regular contributor with little experience in the pandas community, I've observed that the label 'good first issue' extends beyond mere simplicity in solving a problem. Typically, it's assigned to documentation-related tasks, serving as a helpful starting point for newcomers eager to engage with the community.

However, this doesn't diminish the significance of documentation enhancements in the community's eyes. Every pull request, regardless of its nature, undergoes review and suggestions. Nevertheless, complex documentation improvements might demand more effort from reviewers, particularly as they often originate from contributors unfamiliar with the pandas community.

I believe one just needs to modify the documentation to state that the categories must be the equal as indexes, and that in particular includes their dtype.

In issues #57273 and #57281, I believe @VladimirFokow intended to convey that not only this specific aspect of the documentation, but the documentation as a whole, seems somewhat perplexing to him.

@VladimirFokow
Copy link
Contributor Author

VladimirFokow commented Feb 7, 2024

thank you so much @luke396 !!
Okay, help on the review sounds good👍

@eaedk
Copy link

eaedk commented Feb 7, 2024

Hey, I hope you're doing well team.
Looking for an interesting issue to contribute, any reco please?

@VladimirFokow
Copy link
Contributor Author

Hello @eaedk !
How about this one?

@eaedk
Copy link

eaedk commented Feb 11, 2024

Hello @VladimirFokow, oh I thought it was solved.
I would like to, of course I will ask questions for guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants