-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
PERF: improve MultiIndex get_loc performance #16346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
3d3074b
d1d1513
664d2b3
7cd3cc1
3bd0404
287817a
638f011
8acc9e8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,13 +5,16 @@ | |
|
||
import numpy as np | ||
from pandas._libs import hashing | ||
from pandas.compat import string_and_binary_types, text_type | ||
from pandas.core.dtypes.generic import ( | ||
ABCMultiIndex, | ||
ABCIndexClass, | ||
ABCSeries, | ||
ABCDataFrame) | ||
from pandas.core.dtypes.common import ( | ||
is_categorical_dtype, is_list_like) | ||
from pandas.core.dtypes.missing import isnull | ||
|
||
|
||
# 16 byte long hashing key | ||
_default_hash_key = '0123456789123456' | ||
|
@@ -179,9 +182,17 @@ def hash_tuple(val, encoding='utf8', hash_key=None): | |
hash | ||
|
||
""" | ||
hashes = (hash_array(np.array([v]), encoding=encoding, hash_key=hash_key, | ||
categorize=False) | ||
#def to_array(v): | ||
# dtype, arr = infer_dtype_from_array([v]) | ||
# return np.asarray(arr, dtype=dtype) | ||
|
||
#hashes = (hash_array(to_array(v), encoding=encoding, hash_key=hash_key, | ||
# categorize=False) | ||
# for v in val) | ||
|
||
hashes = (_hash_scalar(v, encoding=encoding, hash_key=hash_key) | ||
for v in val) | ||
|
||
h = _combine_hash_arrays(hashes, len(val))[0] | ||
|
||
return h | ||
|
@@ -299,3 +310,63 @@ def hash_array(vals, encoding='utf8', hash_key=None, categorize=True): | |
vals *= np.uint64(0x94d049bb133111eb) | ||
vals ^= vals >> 31 | ||
return vals | ||
|
||
|
||
def _hash_scalar(val, encoding='utf8', hash_key=None): | ||
""" | ||
Hash scalar value | ||
|
||
Returns | ||
------- | ||
1d uint64 numpy array of hash value, of length 1 | ||
""" | ||
|
||
if hash_key is None: | ||
hash_key = _default_hash_key | ||
|
||
if isnull(val): | ||
# this is to be consistent with the _hash_categorical implementation | ||
return np.array([np.iinfo(np.uint64).max], dtype='u8') | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you need to handle datetime w/tz directly (IOW, we basically ignore it), then I would.
I suppose an option to ignore tz is fine for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, I can certainly do that check here as well. That is maybe better to keep the custom logic here, as the keyword added to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. he i think better locally |
||
if isinstance(val, string_and_binary_types + (text_type,)): | ||
vals = np.array([val], dtype=object) | ||
string_like = True | ||
else: | ||
vals = np.array([val]) | ||
string_like = False | ||
|
||
dtype = vals.dtype | ||
|
||
#dtype, vals = infer_dtype_from_array([vals]) | ||
#if dtype == np.object_: | ||
# vals = np.asarray(vals, dtype='object') | ||
# dtype = vals.dtype | ||
|
||
# we'll be working with everything as 64-bit values, so handle this | ||
# 128-bit value early | ||
if np.issubdtype(dtype, np.complex128): | ||
return hash_array(vals.real) + 23 * hash_array(vals.imag) | ||
|
||
# First, turn whatever array this is into unsigned 64-bit ints, if we can | ||
# manage it. | ||
elif isinstance(dtype, np.bool): | ||
vals = vals.astype('u8') | ||
elif issubclass(dtype.type, (np.datetime64, np.timedelta64)): | ||
vals = vals.view('i8').astype('u8', copy=False) | ||
elif issubclass(dtype.type, np.number) and dtype.itemsize <= 8: | ||
vals = vals.view('u{}'.format(vals.dtype.itemsize)).astype('u8') | ||
else: | ||
if not string_like: | ||
from pandas import Index | ||
vals = Index(vals).values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is also a bit an ugly part. I use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
this is what Series/Index constructors do. About the most complicated code that exists. |
||
return hash_array(vals, hash_key=hash_key, encoding=encoding, | ||
categorize=False) | ||
vals = hashing.hash_object_array(vals, hash_key, encoding) | ||
|
||
# Then, redistribute these 64-bit ints within the space of 64-bit ints | ||
vals ^= vals >> 30 | ||
vals *= np.uint64(0xbf58476d1ce4e5b9) | ||
vals ^= vals >> 27 | ||
vals *= np.uint64(0x94d049bb133111eb) | ||
vals ^= vals >> 31 | ||
return vals |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,7 +6,7 @@ | |
|
||
from pandas import DataFrame, Series, Index, MultiIndex | ||
from pandas.util import hash_array, hash_pandas_object | ||
from pandas.core.util.hashing import hash_tuples, hash_tuple | ||
from pandas.core.util.hashing import hash_tuples, hash_tuple, _hash_scalar | ||
import pandas.util.testing as tm | ||
|
||
|
||
|
@@ -81,10 +81,20 @@ def test_hash_tuples(self): | |
|
||
def test_hash_tuple(self): | ||
# test equivalence between hash_tuples and hash_tuple | ||
tup = (1, 'one') | ||
result = hash_tuple(tup) | ||
expected = hash_tuples([tup])[0] | ||
assert result == expected | ||
for tup in [(1, 'one'), (1, np.nan)]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. test with pd.Nat as well |
||
result = hash_tuple(tup) | ||
expected = hash_tuples([tup])[0] | ||
assert result == expected | ||
|
||
def test_hash_scalar(self): | ||
for val in [1, 1.4, 'A', b'A', u'A', pd.Timestamp("2012-01-01"), | ||
pd.Timestamp("2012-01-01", tz='Europe/Brussels'), | ||
pd.Period('2012-01-01', freq='D'), pd.Timedelta('1 days'), | ||
pd.Interval(0, 1), np.nan, pd.NaT, None]: | ||
result = _hash_scalar(val) | ||
expected = hash_array(np.array([val], dtype=object), | ||
categorize=True) | ||
assert result[0] == expected[0] | ||
|
||
def test_hash_tuples_err(self): | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is duplicating lots of code. I sure there is a way to do this a bit more generically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, but this was mainly to test.
And from that it, this makes it a lot faster than using the
hash_array
(the commented part inhash_tuple
). So not sure how to solve that. In principle I can put the common parts in helper functions (eg the redistributing part), but for most of it it is not possible as there are slight differences.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does all of this buy you? (IOW can you post updated timings)?
maintaining a separate code path for scalars will cause future issues. these will need to be kept in sync (with the array hashing), if any code changes are made. you can easily share code here which would make this more palatable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash_tuple2
uses thehash_array
(the commented out version in the current branch), thehash_tuple
uses thehash_scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, its prob reasonable, but willing to sacrifice some perf to get some shared code (IOW between old and new prob a good compromise to repeating lots of code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I pushed a new version that almost fully reuses the
hash_array
function for the actual hashing, only has some specific logic before that to convert the scalar to a proper array.This reduces a lot of the code duplication, and has only minor perf impact.
Apart from that, it still has the separate code path to first convert the scalar to an array, which might be a bit brittle, and I agree certainly not ideal for code maintenance, but using something more general (eg infer_dtype) gives a big perf penalty.