-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: add method Categorical.__contains__ #21508
PERF: add method Categorical.__contains__ #21508
Conversation
Hello @topper-123! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on June 20, 2018 at 10:29 Hours UTC |
1ecce70
to
05e3680
Compare
pandas/core/arrays/categorical.py
Outdated
@@ -1847,6 +1847,31 @@ def __iter__(self): | |||
"""Returns an Iterator over the values of this Categorical.""" | |||
return iter(self.get_values().tolist()) | |||
|
|||
def __contains__(self, key): | |||
"""Returns True if `key` is in this Categorical.""" | |||
hash(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn’t this very similar to what u did in CI ?
can we unify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a unified version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it looks very similiar. why cannot you just use key in self._data
for the CI
b7f49c2
to
3eb49bb
Compare
Codecov Report
@@ Coverage Diff @@
## master #21508 +/- ##
==========================================
+ Coverage 91.91% 91.92% +<.01%
==========================================
Files 153 153
Lines 49546 49572 +26
==========================================
+ Hits 45542 45568 +26
Misses 4004 4004
Continue to review full report at Codecov.
|
@@ -71,6 +71,23 @@ def test_isin_empty(empty): | |||
tm.assert_numpy_array_equal(expected, result) | |||
|
|||
|
|||
def test_contains(): | |||
|
|||
c = pd.Categorical(list('aabbca'), categories=list('cab')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not a bug, but can we reference the original issue anyway?
assert 1 not in c | ||
|
||
c = pd.Categorical(list('aabbca') + [np.nan], categories=list('cab')) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unnecessary newline IMO
pandas/core/arrays/categorical.py
Outdated
|
||
This is a helper method used in :method:`Categorical.__contains__` | ||
and in :class:`CategoricalIndex.__contains__`. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is not a one-liner doc-string:
- The docstring should start underneath the triple parentheses.
- The first line is a summary should only be one line
- List of parameters and their dtypes
- Return description / type
@topper-123 : A bunch of nit-picking things, but overall, this looks pretty good! |
Comments addressed. |
pandas/core/arrays/categorical.py
Outdated
Notes | ||
----- | ||
This method does not check for Nan values. Do that separately | ||
before calling this method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Couple of things:
- Notes section goes at the bottom
- The first line ("Helper for membership check...") should be underneath the triple quotes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, both done.
But I've never noticed "the first line should be underneath the triple quotes"-rule and it's not widely applied in pandas. Are you sure that's a style rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally try to follow the numpy
style-docstrings, which have a newline as I have requested. You can have a look at their docstrings to a get a sense.
7533c6f
to
a03c92f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
cc @jreback
doc/source/whatsnew/v0.23.2.txt
Outdated
@@ -27,6 +27,9 @@ Performance Improvements | |||
- Improved performance of membership checks in :class:`CategoricalIndex` | |||
(i.e. ``x in ci``-style checks are much faster). :meth:`CategoricalIndex.contains` | |||
is likewise much faster (:issue:`21369`) | |||
- Improved performance of membership checks in :class:`Categorical` | |||
(i.e. ``x in categorical``-style checks are much faster) (:issue:`21369`) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come there is a newline here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I LGTM'ed and then realized I wanted to make one more comment 😂
a03c92f
to
b685279
Compare
Ok, won't hold you up to that approval, fixed :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate it 😄 . LGTM!
cc @jreback
b685279
to
07dd41c
Compare
doc/source/whatsnew/v0.23.2.txt
Outdated
@@ -27,6 +27,8 @@ Performance Improvements | |||
- Improved performance of membership checks in :class:`CategoricalIndex` | |||
(i.e. ``x in ci``-style checks are much faster). :meth:`CategoricalIndex.contains` | |||
is likewise much faster (:issue:`21369`) | |||
- Improved performance of membership checks in :class:`Categorical` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this refernce this PR number (or just coming the what's new into 1 line either way)
pandas/core/indexes/category.py
Outdated
@@ -328,23 +327,8 @@ def __contains__(self, key): | |||
if isna(key): # if key is a NaN, check if any NaN is in self. | |||
return self.hasnans | |||
|
|||
# is key in self.categories? Then get its location. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't you just do
```return key in self._data``?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would give the same result, but _engine
gives cached results, so that makes it much faster. Indexes should return cached results when possible, while categorical
is mutable, so can't return cached values, requiring a different implementation.
pandas/core/arrays/categorical.py
Outdated
@@ -1847,6 +1847,67 @@ def __iter__(self): | |||
"""Returns an Iterator over the values of this Categorical.""" | |||
return iter(self.get_values().tolist()) | |||
|
|||
@staticmethod | |||
def _contains(key, categories, container): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems pretty convoluted, see my comment below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I find it difficult to maintain commonality between the two __contains__
methods without something similar to this... Maybe cut down on comments, now that there is a doc string? Do you have a suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a new proposal. This makes contains
a common function. Perhaps clearer as a function rather than a method that this is shared functionality?
It's not convoluted in the sense, that the index version can look in _engine
for greater speed, while the Categorical can't do that and so must look in _codes
, otherwise it's the same.
c3ff8d7
to
77386f1
Compare
pandas/core/arrays/categorical.py
Outdated
@@ -1847,6 +1847,31 @@ def __iter__(self): | |||
"""Returns an Iterator over the values of this Categorical.""" | |||
return iter(self.get_values().tolist()) | |||
|
|||
def __contains__(self, key): | |||
"""Returns True if `key` is in this Categorical.""" | |||
hash(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it looks very similiar. why cannot you just use key in self._data
for the CI
@jreback , using >>> n = 100_000
>>> ci = pd.CategoricalIndex(list('a'*n +'a' + 'b'*n + 'c'*n))
>>> %timeit ci.categories.get_loc('b') in ci._engine
1.6 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit 'b' in ci._data
183 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) So there needs to be different implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ok. some comments. ping when green.
pandas/core/arrays/categorical.py
Outdated
|
||
Notes | ||
----- | ||
This method does not check for Nan values. Do that separately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nan -> NaN.
pandas/core/arrays/categorical.py
Outdated
@@ -1847,6 +1896,15 @@ def __iter__(self): | |||
"""Returns an Iterator over the values of this Categorical.""" | |||
return iter(self.get_values().tolist()) | |||
|
|||
def __contains__(self, key): | |||
"""Returns True if `key` is in this Categorical.""" | |||
hash(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your hash check needs to go into contains itself (null check is ok where it is).
can you add a test on something that isn't hashable as well
pandas/core/arrays/categorical.py
Outdated
"""Returns True if `key` is in this Categorical.""" | ||
hash(key) | ||
|
||
if isna(key): # if key is a NaN, check if any NaN is in self. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put comments on the line above
@@ -71,6 +71,22 @@ def test_isin_empty(empty): | |||
tm.assert_numpy_array_equal(expected, result) | |||
|
|||
|
|||
def test_contains(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to test_indexing, should be other checks already there, de-duplicate with those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at test_indexing, didn't find anything very similar. I moved it to test_operators, which is more similar, IMO. Is that ok?
89e92da
to
b0f12ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason you changed the private ._contains
to public contains
? (it's not needed for the implementation?)
It's just that I am not sure it is worth adding a new public method.
@jorisvandenbossche, to my understanding these modules are themselves private and it’s not necessary for functions in private modules to be marked as private - that would be superfluous. When it was a method on Categorical, it was needed to mark it private, as Categorical is a public class. |
The Travis error is an HTTP error:
so the Travis failure has nothing to do with this PR. |
@topper-123 whoops, sorry, only looked very briefly at the incremental diff and missed that it was a function and not a method. |
thanks @topper-123 |
git diff upstream/master -u -- "*.py" | flake8 --diff
Currently, membership checks in
Categorical
is very slow as explained by @fjetter in #21022. This PR fixes the issue. See also #21369 which fixed the similar issue forCategoricalIndex
.Tests didn't exist beforehand and have been added.
ASV: