Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: fastpaths in is_foo_dtype checks #33400

Merged
merged 5 commits into from
Apr 17, 2020

Conversation

jbrockmendel
Copy link
Member

xref #33364, partially addresses #33368

For exposition I used the new type checking functions in parsers.pyx.

For the implementation I chose .type and .kind checks in order to avoid needing the imports from dtypes.dtypes, which has dependencies on a bunch of other parts of the code.

In [1]: import pandas as pd                                                                                                                                                                                                                                                                                                    
In [2]: from pandas.core.dtypes.common import *                                                                                                                                                                                                                                                                                
In [3]: cat = pd.Categorical([])                                       
In [4]: arr = np.arange(5)
                                                                                                                                                                         
In [5]: %timeit is_categorical_dtype(cat.dtype)                                                                                                                                                                                                                                                                                
1.57 µs ± 34.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [6]: %timeit is_cat_dtype(cat.dtype)                                                                                                                                                                                                                                                                                        
316 ns ± 3.78 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit is_extension_array_dtype(cat.dtype)                                                                                                                                                                                                                                                                            
364 ns ± 5.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit is_ea_dtype(cat.dtype)                                                                                                                                                                                                                                                                                         
270 ns ± 7.31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [9]: %timeit is_extension_array_dtype(arr.dtype)                                                                                                                                                                                                                                                                           
759 ns ± 5.16 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [10]: %timeit is_ea_dtype(arr.dtype)                                                                                                                                                                                                                                                                                        
199 ns ± 8.77 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Apr 9, 2020
@jorisvandenbossche
Copy link
Member

Related to the discussion we had about this on the chat, to give an example of the strategy that I tried to explain:

The current is_categorical_dtype is defined as:

def is_categorical_dtype(arr_or_dtype) -> bool:
    if arr_or_dtype is None:
        return False
    return CategoricalDtype.is_dtype(arr_or_dtype)

where CategoricalDtype.is_dtype() is the catch-all / check-all slow check.

So for this case, we can put the fast check as in this PR first:

def is_categorical_dtype(arr_or_dtype) -> bool:
    if isinstance(dtype, ExtensionDtype):
        # fast check for extension dtype
        return dtype.name == "category"
    elif isinstance(dtype, np.dtype):
        # fast check for numpy dtype (always False)
        return False
    elif arr_or_dtype is None:
        return False
    # keep the slow check if the fast ones didn't return
    return CategoricalDtype.is_dtype(arr_or_dtype)

And in that way, we don't have to change every occurrence of is_categorical_dtype internally to the fast function.

It might be that categorical is a simple example, but I would think that a similar pattern can be followed for the others as well.

@jbrockmendel
Copy link
Member Author

Related to the discussion we had about this on the chat, to give an example of the strategy that I tried to explain:

Also as discussed on the chat, the advantage of the dtype-only versions is that we dont risk silently going down the slow path.

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Apr 10, 2020
@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

Related to the discussion we had about this on the chat, to give an example of the strategy that I tried to explain:

Also as discussed on the chat, the advantage of the dtype-only versions is that we dont risk silently going down the slow path.

yep i would start by using the approach outlined above.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be changed to use the approach @jorisvandenbossche outlined, e.g. not changing the names of the current is_* functions, just enhancing the impl with a fast path.

@jbrockmendel
Copy link
Member Author

I'm happy to update the existing functions to be faster, but we should also have the strict versions internally to make sure we actually use the fastpath

@jbrockmendel
Copy link
Member Author

Actually, adding the fastpath to the existing checks will actually slow them down slightly in cases where we're not passing a dtype obj

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 16, 2020

I think I'm OK with slowing down the non-dtype case.

I'm also OK with deprecating non-dtype entirely, but that's a larger task that can be saved for later. Short-term, I think we can rely on reviewers to catch places where we're passing non-dtype's to is_foo_dtype.

Edit: all of the above is contingent with actually being able to quickly do an is_foo_dtype on a dtype as a fastpath. I only checked for categorical. Hopefully it's doable for the rest.

@jreback
Copy link
Contributor

jreback commented Apr 16, 2020

I think I'm OK with slowing down the non-dtype case.

I'm also OK with deprecating non-dtype entirely, but that's a larger task that can be saved for later. Short-term, I think we can rely on reviewers to catch places where we're passing non-dtype's to is_foo_dtype.

Edit: all of the above is contingent with actually being able to quickly do an is_foo_dtype on a dtype as a fastpath. I only checked for categorical. Hopefully it's doable for the rest.

agreed here

@jbrockmendel
Copy link
Member Author

One more doomed pitch for the strict versions: they're really nice dependency-structure-wise. ATM dtypes.common depends on dtypes.dtypes, which has runtime imports and depends on a bunch of the code. With just the strict versions, we can get dtypes.common (which we import from basically everywhere) to depend only on things "above" it in the dependency structure.

I know others dont care about the dependency structure as much as I do, will update.

@jbrockmendel jbrockmendel changed the title PERF: dtype-only is_foo_dtype checks (up to 5x faster) PERF: fastpaths in is_foo_dtype checks Apr 16, 2020
@jbrockmendel
Copy link
Member Author

updated+green

@jreback jreback added this to the 1.1 milestone Apr 17, 2020
@jreback jreback merged commit 1fa0635 into pandas-dev:master Apr 17, 2020
@jbrockmendel jbrockmendel deleted the perf-dtype-checks branch April 17, 2020 03:01
CloseChoice pushed a commit to CloseChoice/pandas that referenced this pull request Apr 20, 2020
* PERF: implement dtype-only dtype checks

* remove strict versions
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
* PERF: implement dtype-only dtype checks

* remove strict versions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants