-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Add __contains__ to CategoricalIndex #21369
Conversation
19ba6ec
to
30dc1a3
Compare
7878db0
to
35a68d7
Compare
Codecov Report
@@ Coverage Diff @@
## master #21369 +/- ##
==========================================
+ Coverage 91.9% 91.9% +<.01%
==========================================
Files 153 153
Lines 49606 49610 +4
==========================================
+ Hits 45589 45593 +4
Misses 4017 4017
Continue to review full report at Codecov.
|
pandas/core/indexes/category.py
Outdated
@@ -324,20 +324,19 @@ def _reverse_indexer(self): | |||
@Appender(_index_shared_docs['__contains__'] % _index_doc_kwargs) | |||
def __contains__(self, key): | |||
hash(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a really silly question, but what does this line do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just ensures that mutables are not passes into the function....
I think it should probably be removed, but that line is in various other places as well, so maybe a seperate PR, that goes through all similar cases? Or I can just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you removed it in another part of the diff, hence why I'm asking. That being said, I like your suggestion. Let's investigate for another time then, in which case I would put back the other one that you deleted.
pandas/core/indexes/category.py
Outdated
if self.categories._defer_to_indexing: | ||
return key in self.categories | ||
|
||
return key in self.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me: why do we NOT need to check membership in self.values
anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For indices, their indexing engine (i.e. ._engine
) has a __contains__
method which does the same thing but is faster (does caching etc. probably, haven't looked into the details of the code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for clarifying!
0fea48b
to
5c18b8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
cc @jreback
pandas/core/indexes/category.py
Outdated
return False | ||
if is_scalar(loc): | ||
return loc in self._engine | ||
else: # if self.categories is IntervalIndex, loc is an array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a blank line between things, e.g.
if isna(...):
....
try:
....
except:
....
if is_scalar(...):
...
# no else needed here
return ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also can you put comments before each case (not everythin needs a comment), but i find this hard to grok in its current form.
6ed9dd6
to
cfb8f6d
Compare
e616770
to
2a87671
Compare
2a87671
to
ccfba1b
Compare
I've updated the PR. I've set it to be part of 0.23.2, if that's alright. |
thanks @topper-123 |
This is changing the implementation quite substantially, so let's move this to 0.24.0.txt? |
Any comments on my comment above about keeping this for 0.24.0 ? |
Quility-wise this is ok to go into 23.2 IMO, the PRs are really not that complex, IMO, it's much faster and it doesn't change any APIs. Also, my main motivation for writing this was speeding up slicing dataframes with a CategoricalIndex (see #20395), which previously was very slow (still is slow, but better than before, and now faster than fancy indexing, at least). I think a lot of people will appreciate this speedup. |
we tagged for 0.23.2 (and note is there) |
The main reason that I am raising this is that
Yep, it would be both in 0.23.2 or both in 0.24.0 (the other I actually tagged first as 0.24.0, but that was changed before merging) |
Ok, your call, I won't object. |
OK, left it for 0.24 (moved the whatsnew already to v0.24.0.txt) |
git diff upstream/master -u -- "*.py" | flake8 --diff
Currently, membership checks in
CategoricalIndex
is very slow as explained in #21022. This PR fixes the issue forCategoricalIndex
, while #21022 contains the fix forCategorical
. The difference between the two cases is the use of_engine
forCategoricalIndex
, which makes this even faster than theCatagorical
solution in #21022.Tests exist already and can be found in
tests/indexes/test_category.py::TestCategoricalIndex::test_contains
.ASV: