Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: CategoricalDtype.update_dtype #59647

Merged

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Aug 28, 2024

If a CategoricalDtype is passed to CategoricalDtype.update_dtype, this API will attempt to unnecessarily re-validate the categories if it was not None.

CategoricalDtype.update_dtype is called in constructors like Categorical.__init__ and Categorical._simple_new where there is an attempt to update the passed dtype with ordered=False if it was None. A fully validated CategoricalDtype should just return itself if passed to update_dtype

In [1]: import pandas as pd

In [2]: cdtype = pd.CategoricalDtype(categories=list(range(100_000)), ordered=True)

In [3]: base_dtype = pd.CategoricalDtype(ordered=False)

In [4]: %timeit base_dtype.update_dtype(cdtype)
2.5 μs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [4]: %timeit base_dtype.update_dtype(cdtype)
865 ns ± 2.26 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

@mroeschke mroeschke added Performance Memory or execution speed performance Categorical Categorical Data Type labels Aug 28, 2024
@mroeschke mroeschke added this to the 3.0 milestone Aug 28, 2024
@galipremsagar
Copy link

Thanks for the fix @mroeschke !

@mroeschke
Copy link
Member Author

Looks like tests are passing here so merging. Happy to follow up if needed

@mroeschke mroeschke merged commit 85be99e into pandas-dev:main Sep 5, 2024
47 checks passed
@mroeschke mroeschke deleted the perf/categoricaldtype/update_dtype branch September 5, 2024 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants