PERF: Add contains to CategoricalIndex #21369

topper-123 · 2018-06-07T21:38:54Z

progress towards PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395
xref PERF: __contains__ method for Categorical #21022
tests added / passed
benchmark added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Currently, membership checks in CategoricalIndex is very slow as explained in #21022. This PR fixes the issue for CategoricalIndex, while #21022 contains the fix for Categorical. The difference between the two cases is the use of _engine for CategoricalIndex, which makes this even faster than the Catagorical solution in #21022.

Tests exist already and can be found in tests/indexes/test_category.py::TestCategoricalIndex::test_contains.

ASV:

      before           after         ratio
     [0c65c57a]       [986779ab]
-      2.49±0.2ms       3.26±0.2μs     0.00  categoricals.Contains.time_contains

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

codecov · 2018-06-07T22:40:32Z

Codecov Report

Merging #21369 into master will increase coverage by <.01%.
The diff coverage is 90%.

@@            Coverage Diff             @@
##           master   #21369      +/-   ##
==========================================
+ Coverage    91.9%    91.9%   +<.01%     
==========================================
  Files         153      153              
  Lines       49606    49610       +4     
==========================================
+ Hits        45589    45593       +4     
  Misses       4017     4017

Flag	Coverage Δ
#multiple	`90.3% <90%> (ø)`	⬆️
#single	`41.89% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/category.py	`97.09% <90%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 576d5c6...ccfba1b. Read the comment docs.

gfyoung · 2018-06-08T04:01:07Z

pandas/core/indexes/category.py

@@ -324,20 +324,19 @@ def _reverse_indexer(self):
    @Appender(_index_shared_docs['__contains__'] % _index_doc_kwargs)
    def __contains__(self, key):
        hash(key)


This might be a really silly question, but what does this line do?

It just ensures that mutables are not passes into the function....

I think it should probably be removed, but that line is in various other places as well, so maybe a seperate PR, that goes through all similar cases? Or I can just remove it.

Well, you removed it in another part of the diff, hence why I'm asking. That being said, I like your suggestion. Let's investigate for another time then, in which case I would put back the other one that you deleted.

gfyoung · 2018-06-08T04:03:11Z

pandas/core/indexes/category.py

-        if self.categories._defer_to_indexing:
-            return key in self.categories
-
-        return key in self.values


Remind me: why do we NOT need to check membership in self.values anymore?

For indices, their indexing engine (i.e. ._engine) has a __contains__ method which does the same thing but is faster (does caching etc. probably, haven't looked into the details of the code).

Awesome, thanks for clarifying!

gfyoung

Nice!

cc @jreback

jreback · 2018-06-13T11:14:11Z

pandas/core/indexes/category.py

+            return False
+        if is_scalar(loc):
+            return loc in self._engine
+        else:  # if self.categories is IntervalIndex, loc is an array


can you put a blank line between things, e.g.

if isna(...): .... try: .... except: .... if is_scalar(...): ... # no else needed here return ...

also can you put comments before each case (not everythin needs a comment), but i find this hard to grok in its current form.

topper-123 · 2018-06-13T22:07:05Z

I've updated the PR.

I've set it to be part of 0.23.2, if that's alright.

jreback · 2018-06-14T10:38:27Z

thanks @topper-123

jorisvandenbossche · 2018-06-14T18:10:33Z

This is changing the implementation quite substantially, so let's move this to 0.24.0.txt?

jorisvandenbossche · 2018-06-26T21:31:57Z

Any comments on my comment above about keeping this for 0.24.0 ?

topper-123 · 2018-06-27T00:42:22Z

Quility-wise this is ok to go into 23.2 IMO, the PRs are really not that complex, IMO, it's much faster and it doesn't change any APIs.

Also, my main motivation for writing this was speeding up slicing dataframes with a CategoricalIndex (see #20395), which previously was very slow (still is slow, but better than before, and now faster than fancy indexing, at least). I think a lot of people will appreciate this speedup.

jreback · 2018-06-27T00:48:33Z

we tagged for 0.23.2 (and note is there)
it would be slightly tricky to change as there is another related change - both in 0.23.2 or 0.24
either way is fine

jorisvandenbossche · 2018-06-27T11:52:51Z

The main reason that I am raising this is that __contains__ checking has quite some implications (which is of course also the reason you speeded it up!), and I think it is rather easy to miss a small exotic corner case where the new implementation might differ.
To be clear, @topper-123 , I am not questioning the quality of this PR! I just know from experience, also for us, that it is easy to miss an unintended API change (which we might not even decide to fix if it is debateble behaviour, but then that is still better to be left as 0.24.0). Since this is only about performance improvement (and not a regression), I would play on safe and give it more time in 0.24.0.

it would be slightly tricky to change as there is another related change - both in 0.23.2 or 0.24

Yep, it would be both in 0.23.2 or both in 0.24.0 (the other I actually tagged first as 0.24.0, but that was changed before merging)

topper-123 · 2018-06-27T20:48:10Z

Ok, your call, I won't object.

jorisvandenbossche · 2018-07-02T23:25:25Z

OK, left it for 0.24 (moved the whatsnew already to v0.24.0.txt)

topper-123 force-pushed the categorical_contains branch from 19ba6ec to 30dc1a3 Compare June 7, 2018 21:39

topper-123 mentioned this pull request Jun 7, 2018

PERF: __contains__ method for Categorical #21022

Closed

4 tasks

topper-123 force-pushed the categorical_contains branch 2 times, most recently from 7878db0 to 35a68d7 Compare June 7, 2018 22:40

gfyoung added Performance Memory or execution speed performance Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 8, 2018

gfyoung reviewed Jun 8, 2018

View reviewed changes

topper-123 force-pushed the categorical_contains branch 6 times, most recently from 0fea48b to 5c18b8a Compare June 9, 2018 14:58

gfyoung approved these changes Jun 9, 2018

View reviewed changes

jreback requested changes Jun 13, 2018

View reviewed changes

topper-123 force-pushed the categorical_contains branch 2 times, most recently from 6ed9dd6 to cfb8f6d Compare June 13, 2018 16:42

Add __contains__ to CategoricalIndex

f856075

topper-123 force-pushed the categorical_contains branch 3 times, most recently from e616770 to 2a87671 Compare June 13, 2018 17:53

make CategoricalIndex.__contains__ compatible with np<1.13

ccfba1b

topper-123 force-pushed the categorical_contains branch from 2a87671 to ccfba1b Compare June 13, 2018 18:00

jreback added this to the 0.23.2 milestone Jun 14, 2018

jreback approved these changes Jun 14, 2018

View reviewed changes

jreback merged commit bf1c3dc into pandas-dev:master Jun 14, 2018

topper-123 mentioned this pull request Jun 15, 2018

PERF: improve speed of nans in CategoricalIndex #21493

Merged

jreback added the Needs Backport label Jun 15, 2018

This was referenced Jun 15, 2018

PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395

Closed

PERF: avoid unnecessary recoding in CategoricalIndex._create_categorical #21506

Closed

PERF: add method Categorical.__contains__ #21508

Merged

topper-123 deleted the categorical_contains branch June 17, 2018 09:21

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

PERF: Add __contains__ to CategoricalIndex (pandas-dev#21369)

19b3598

jorisvandenbossche removed the Needs Backport label Jul 2, 2018

jorisvandenbossche modified the milestones: 0.23.2, 0.24.0 Jul 2, 2018

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

PERF: Add __contains__ to CategoricalIndex (pandas-dev#21369)

247e0f1

Uh oh!

PERF: Add __contains__ to CategoricalIndex #21369

PERF: Add __contains__ to CategoricalIndex #21369

Uh oh!

Conversation

topper-123 commented Jun 7, 2018

Uh oh!

codecov bot commented Jun 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gfyoung Jun 8, 2018

Choose a reason for hiding this comment

Uh oh!

topper-123 Jun 8, 2018

Choose a reason for hiding this comment

Uh oh!

gfyoung Jun 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Jun 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

topper-123 Jun 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Jun 8, 2018

Choose a reason for hiding this comment

Uh oh!

gfyoung left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Jun 13, 2018

Choose a reason for hiding this comment

Uh oh!

jreback Jun 13, 2018

Choose a reason for hiding this comment

Uh oh!

topper-123 commented Jun 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Jun 14, 2018

Uh oh!

jorisvandenbossche commented Jun 14, 2018

Uh oh!

jorisvandenbossche commented Jun 26, 2018

Uh oh!

topper-123 commented Jun 27, 2018

Uh oh!

jreback commented Jun 27, 2018

Uh oh!

jorisvandenbossche commented Jun 27, 2018

Uh oh!

topper-123 commented Jun 27, 2018

Uh oh!

jorisvandenbossche commented Jul 2, 2018

Uh oh!

Uh oh!

PERF: Add contains to CategoricalIndex #21369

PERF: Add contains to CategoricalIndex #21369

codecov bot commented Jun 7, 2018 •

edited

Loading

gfyoung Jun 8, 2018 •

edited

Loading

gfyoung Jun 8, 2018 •

edited

Loading

topper-123 Jun 8, 2018 •

edited

Loading

topper-123 commented Jun 13, 2018 •

edited

Loading