Skip to content

PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 #21699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add Int8Index
  • Loading branch information
tp authored and topper-123 committed Sep 23, 2018
commit 91ee55dbb0bbfd598d97c50cd458681d5703816c
8 changes: 6 additions & 2 deletions pandas/_libs/index_class_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,13 @@ WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in

# name, dtype, ctype
dtypes = [('Float64', 'float64', 'float64_t'),
('UInt64', 'uint64', 'uint64_t'),
('Int64', 'int64', 'int64_t'),
('Object', 'object', 'object')]
('Int32', 'int32', 'int32_t'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add for other UInt types (8,16,32) as well

('Int16', 'int16', 'int16_t'),
('Int8', 'int8', 'int8_t'),
('UInt64', 'uint64', 'uint64_t'),
('Object', 'object', 'object'),
]
}}

{{for name, dtype, ctype in dtypes}}
Expand Down
8 changes: 6 additions & 2 deletions pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,11 @@ class CategoricalIndex(Index, accessor.PandasDelegate):
"""

_typ = 'categoricalindex'
_engine_type = libindex.Int64Engine

@property
def _engine_type(self):
type_name = self.codes.dtype.name.capitalize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here

return getattr(libindex, "{}Engine".format(type_name))
_attributes = ['name']

def __new__(cls, data=None, categories=None, ordered=None, dtype=None,
Expand Down Expand Up @@ -377,7 +381,7 @@ def argsort(self, *args, **kwargs):
def _engine(self):

# we are going to look things up with the codes themselves
return self._engine_type(lambda: self.codes.astype('i8'), len(self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need any changes in cython if you simply so: .astype('i8', copy=False), which is the cause of the perf issue, no?

its still not as nice as actually using type specific hashtables though (which this PR is not addressing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing that. Still is slow, 14 ms.

In index_class_helper.pxi.in, I also tried changing

    cdef _get_index_values(self):
        return algos.ensure_{{dtype}}(self.vgetter())

to

    cdef _get_index_values(self):
        return self.vgetter()

But also slow, 14 ms.

I agree that type specific hash tables would be nicer, but I've tried and I failed making it work. If someone could contribute those, I could change this PR to use type specific hash tables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then you have something else going on

Copy link
Contributor Author

@topper-123 topper-123 Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numpy docs say about the copy parameter of array.astype:

By default, astype always returns a newly allocated array. If this is set
to false, and the dtype, order, and subok requirements are satisfied,
the input array is returned instead of a copy.

So, calling astype(..., copy=False) will only avoid returning a copy when the dtype of codes is int64, i.e. in practice never for Categoricals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so try using algos.ensure_int64(values) I find this extremely odd that you have to do the if inside cython.

return self._engine_type(lambda: self.codes, len(self))

# introspection
@cache_readonly
Expand Down