-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 #21699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
297ec19
91ee55d
4d0612e
5a89a51
5575e93
2fa526f
4bc74f5
c9f1166
6aa94e9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -83,7 +83,11 @@ class CategoricalIndex(Index, accessor.PandasDelegate): | |
""" | ||
|
||
_typ = 'categoricalindex' | ||
_engine_type = libindex.Int64Engine | ||
|
||
@property | ||
def _engine_type(self): | ||
type_name = self.codes.dtype.name.capitalize() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a comment here |
||
return getattr(libindex, "{}Engine".format(type_name)) | ||
_attributes = ['name'] | ||
|
||
def __new__(cls, data=None, categories=None, ordered=None, dtype=None, | ||
|
@@ -377,7 +381,7 @@ def argsort(self, *args, **kwargs): | |
def _engine(self): | ||
|
||
# we are going to look things up with the codes themselves | ||
return self._engine_type(lambda: self.codes.astype('i8'), len(self)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think you need any changes in cython if you simply so: its still not as nice as actually using type specific hashtables though (which this PR is not addressing) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried changing that. Still is slow, 14 ms. In index_class_helper.pxi.in, I also tried changing cdef _get_index_values(self):
return algos.ensure_{{dtype}}(self.vgetter()) to cdef _get_index_values(self):
return self.vgetter() But also slow, 14 ms. I agree that type specific hash tables would be nicer, but I've tried and I failed making it work. If someone could contribute those, I could change this PR to use type specific hash tables. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. then you have something else going on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The numpy docs say about the
So, calling astype(..., copy=False) will only avoid returning a copy when the dtype of codes is int64, i.e. in practice never for Categoricals. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so try using |
||
return self._engine_type(lambda: self.codes, len(self)) | ||
|
||
# introspection | ||
@cache_readonly | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add for other UInt types (8,16,32) as well