Skip to content

PERF: improve MultiIndex get_loc performance #16346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feedback
  • Loading branch information
jorisvandenbossche committed May 16, 2017
commit 664d2b35bd0ec240c4d638ccf85f21761dcf8fee
2 changes: 2 additions & 0 deletions pandas/_libs/hashtable.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ cdef class MultiIndexHashTable(HashTable):

cpdef get_item(self, object val)
cpdef set_item(self, object key, Py_ssize_t val)
cdef inline void _check_for_collision(self, Py_ssize_t loc, object label)


cdef class StringHashTable(HashTable):
cdef kh_str_t *table
Expand Down
10 changes: 7 additions & 3 deletions pandas/_libs/hashtable_class_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ Template for each `dtype` helper function for hashtable
WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in
"""

from pandas.core.dtypes.missing import array_equivalent
from lib cimport is_null_datetimelike


#----------------------------------------------------------------------
# VectorData
Expand Down Expand Up @@ -923,12 +924,15 @@ cdef class MultiIndexHashTable(HashTable):
"hash collision\nlocs:\n{}\n"
"result:\n{}\nmi:\n{}".format(alocs, result, mi))

def _check_for_collision(self, Py_ssize_t loc, object label):
cdef inline void _check_for_collision(self, Py_ssize_t loc, object label):
# validate that the loc maps to the actual value
# version of _check_for_collisions above for single label (tuple)

result = self.mi[loc]
if not array_equivalent(result, label):

if not all(l == r or (is_null_datetimelike(l)
and is_null_datetimelike(r))
for l, r in zip(result, label)):
raise AssertionError(
"hash collision\nloc:\n{}\n"
"result:\n{}\nmi:\n{}".format(loc, result, label))
Expand Down
75 changes: 73 additions & 2 deletions pandas/core/util/hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,16 @@

import numpy as np
from pandas._libs import hashing
from pandas.compat import string_and_binary_types, text_type
from pandas.core.dtypes.generic import (
ABCMultiIndex,
ABCIndexClass,
ABCSeries,
ABCDataFrame)
from pandas.core.dtypes.common import (
is_categorical_dtype, is_list_like)
from pandas.core.dtypes.missing import isnull


# 16 byte long hashing key
_default_hash_key = '0123456789123456'
Expand Down Expand Up @@ -179,9 +182,17 @@ def hash_tuple(val, encoding='utf8', hash_key=None):
hash

"""
hashes = (hash_array(np.array([v]), encoding=encoding, hash_key=hash_key,
categorize=False)
#def to_array(v):
# dtype, arr = infer_dtype_from_array([v])
# return np.asarray(arr, dtype=dtype)

#hashes = (hash_array(to_array(v), encoding=encoding, hash_key=hash_key,
# categorize=False)
# for v in val)

hashes = (_hash_scalar(v, encoding=encoding, hash_key=hash_key)
for v in val)

h = _combine_hash_arrays(hashes, len(val))[0]

return h
Expand Down Expand Up @@ -299,3 +310,63 @@ def hash_array(vals, encoding='utf8', hash_key=None, categorize=True):
vals *= np.uint64(0x94d049bb133111eb)
vals ^= vals >> 31
return vals


def _hash_scalar(val, encoding='utf8', hash_key=None):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is duplicating lots of code. I sure there is a way to do this a bit more generically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know, but this was mainly to test.
And from that it, this makes it a lot faster than using the hash_array (the commented part in hash_tuple). So not sure how to solve that. In principle I can put the common parts in helper functions (eg the redistributing part), but for most of it it is not possible as there are slight differences.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does all of this buy you? (IOW can you post updated timings)?

maintaining a separate code path for scalars will cause future issues. these will need to be kept in sync (with the array hashing), if any code changes are made. you can easily share code here which would make this more palatable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [4]: %timeit pd.core.util.hashing.hash_tuple2((999, np.nan, 'E'))
380 µs ± 60.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit pd.core.util.hashing.hash_tuple((999, np.nan, 'E'))
81.8 µs ± 3.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

hash_tuple2 uses the hash_array (the commented out version in the current branch), the hash_tuple uses the hash_scalar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, its prob reasonable, but willing to sacrifice some perf to get some shared code (IOW between old and new prob a good compromise to repeating lots of code)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I pushed a new version that almost fully reuses the hash_array function for the actual hashing, only has some specific logic before that to convert the scalar to a proper array.
This reduces a lot of the code duplication, and has only minor perf impact.

Apart from that, it still has the separate code path to first convert the scalar to an array, which might be a bit brittle, and I agree certainly not ideal for code maintenance, but using something more general (eg infer_dtype) gives a big perf penalty.

Hash scalar value

Returns
-------
1d uint64 numpy array of hash value, of length 1
"""

if hash_key is None:
hash_key = _default_hash_key

if isnull(val):
# this is to be consistent with the _hash_categorical implementation
return np.array([np.iinfo(np.uint64).max], dtype='u8')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you need to handle datetime w/tz directly (IOW, we basically ignore it), then I would.

if getattr(val, 'tzinfo', None) is not None:
    val = val.tz_localize(None)

I suppose an option to ignore tz is fine for infer_dtype_from_scalar, but if you add it I would rename, document and test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I can certainly do that check here as well.

That is maybe better to keep the custom logic here, as the keyword added to infer_dtype_from_scalar would not be used anywhere else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

he i think better locally

if isinstance(val, string_and_binary_types + (text_type,)):
vals = np.array([val], dtype=object)
string_like = True
else:
vals = np.array([val])
string_like = False

dtype = vals.dtype

#dtype, vals = infer_dtype_from_array([vals])
#if dtype == np.object_:
# vals = np.asarray(vals, dtype='object')
# dtype = vals.dtype

# we'll be working with everything as 64-bit values, so handle this
# 128-bit value early
if np.issubdtype(dtype, np.complex128):
return hash_array(vals.real) + 23 * hash_array(vals.imag)

# First, turn whatever array this is into unsigned 64-bit ints, if we can
# manage it.
elif isinstance(dtype, np.bool):
vals = vals.astype('u8')
elif issubclass(dtype.type, (np.datetime64, np.timedelta64)):
vals = vals.view('i8').astype('u8', copy=False)
elif issubclass(dtype.type, np.number) and dtype.itemsize <= 8:
vals = vals.view('u{}'.format(vals.dtype.itemsize)).astype('u8')
else:
if not string_like:
from pandas import Index
vals = Index(vals).values
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a bit an ugly part. I use Index(vals) to get the correct type coercion (eg for a Timestamp object, to ensure it does the same as hash_array). But I don't do this in the beginning because this is much slower than needed for simple numerical values.
It would be nice to have a utility function to convert a list of values to an array with the dtype "how pandas wants it" (but maybe that exists and I just don't know about it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a utility function to convert a list of values to an array with the dtype

this is what Series/Index constructors do. About the most complicated code that exists.

return hash_array(vals, hash_key=hash_key, encoding=encoding,
categorize=False)
vals = hashing.hash_object_array(vals, hash_key, encoding)

# Then, redistribute these 64-bit ints within the space of 64-bit ints
vals ^= vals >> 30
vals *= np.uint64(0xbf58476d1ce4e5b9)
vals ^= vals >> 27
vals *= np.uint64(0x94d049bb133111eb)
vals ^= vals >> 31
return vals
20 changes: 15 additions & 5 deletions pandas/tests/util/test_hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

from pandas import DataFrame, Series, Index, MultiIndex
from pandas.util import hash_array, hash_pandas_object
from pandas.core.util.hashing import hash_tuples, hash_tuple
from pandas.core.util.hashing import hash_tuples, hash_tuple, _hash_scalar
import pandas.util.testing as tm


Expand Down Expand Up @@ -81,10 +81,20 @@ def test_hash_tuples(self):

def test_hash_tuple(self):
# test equivalence between hash_tuples and hash_tuple
tup = (1, 'one')
result = hash_tuple(tup)
expected = hash_tuples([tup])[0]
assert result == expected
for tup in [(1, 'one'), (1, np.nan)]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test with pd.Nat as well

result = hash_tuple(tup)
expected = hash_tuples([tup])[0]
assert result == expected

def test_hash_scalar(self):
for val in [1, 1.4, 'A', b'A', u'A', pd.Timestamp("2012-01-01"),
pd.Timestamp("2012-01-01", tz='Europe/Brussels'),
pd.Period('2012-01-01', freq='D'), pd.Timedelta('1 days'),
pd.Interval(0, 1), np.nan, pd.NaT, None]:
result = _hash_scalar(val)
expected = hash_array(np.array([val], dtype=object),
categorize=True)
assert result[0] == expected[0]

def test_hash_tuples_err(self):

Expand Down