-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: improve MultiIndex get_loc performance #16346
PERF: improve MultiIndex get_loc performance #16346
Conversation
# version of _check_for_collisions above for single label (tuple) | ||
|
||
result = self.mi[loc] | ||
if not array_equivalent(result, label): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to compare 2 tuples here, but cannot just do result == equal
as this would not compare equal when there are NaNs.
For now I used array_equivalent
which does what I want (but still has some overhead). But do we have other utility to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [4]: %timeit array_equivalent((1, 2, np.nan), (1, 2, np.nan))
34.6 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: compare = lambda x, y: all(np.isnan(e) and np.isnan(f) or e == f for e, f in zip(x, y))
In [8]: compare((1,2, np.nan), (1, 2, np.nan))
Out[8]: True
In [9]: %timeit compare((1,2, np.nan), (1, 2, np.nan))
5.37 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, much simpler (and faster)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we're optimizing, math.isnan
is quite a bit faster for scalars:
In [3]: %timeit compare((1,2, np.nan), (1, 2, np.nan))
4.14 µs ± 66.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: import math
In [5]: compare2 = lambda x, y: all(math.isnan(e) and math.isnan(f) or e == f for e, f in zip(x, y))
In [6]: %timeit compare2((1,2, np.nan), (1, 2, np.nan))
1.34 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I think you need to use isnull
; both isnan
and math.isnan
are not complete (IOW they won't for example handle NaT
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but since this is cython, you use lib.is_null_datetimelike
(covers the bases of None, nan, NaT)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I wanted to test whether this comparison was working on a real usecase with NaNs in the index, but this seems to not work anyway (irregardless of this collision check):
In [11]: mi = pd.MultiIndex.from_product([['A', 'B'], [1, np.nan]])
In [12]: mi
Out[12]:
MultiIndex(levels=[['A', 'B'], [1]],
labels=[[0, 0, 1, 1], [0, -1, 0, -1]])
In [13]: mi.get_loc(('A', np.nan))
...
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()
KeyError: ('A', nan)
The above is with 0.19.2, so this was already failing with the PyObject hashtable
pandas/core/util/hashing.py
Outdated
@@ -264,7 +287,7 @@ def hash_array(vals, encoding='utf8', hash_key=None, categorize=True): | |||
|
|||
try: | |||
vals = hashing.hash_object_array(vals, hash_key, encoding) | |||
except TypeError: | |||
except (TypeError, ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to add this ValueError
because I got the following error with the line above:
In [6]: pd.core.util.hashing.hash_array(np.array(['E']), categorize=False)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-6197ea2d3075> in <module>()
----> 1 pd.core.util.hashing.hash_array(np.array(['E']), categorize=False)
/home/joris/scipy/pandas/pandas/core/util/hashing.py in hash_array(vals, encoding, hash_key, categorize)
286
287 try:
--> 288 vals = hashing.hash_object_array(vals, hash_key, encoding)
289 except TypeError:
290 # we have mixed types
/home/joris/scipy/pandas/pandas/_libs/hashing.pyx in pandas._libs.hashing.hash_object_array (pandas/_libs/hashing.c:1608)()
ValueError: Does not understand character buffer dtype format string ('w')
so this gives a ValueError and not a TypeError, while the version in the except does work. But I don't know it is OK to just catch that as well, or whether I should rather change the hash_object_array
array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the hashing doesn't handle fixed-length arrays (it prob could, but we don't generally have them). if its an object array it will work.
In [21]: np.array(['E'])
Out[21]:
array(['E'],
dtype='<U1')
In [22]: np.array(['E'], dtype=object)
Out[22]: array(['E'], dtype=object)
In [20]: pd.core.util.hashing.hash_array(np.array(['E'], dtype=object), categorize=False)
Out[20]: array([16179053688037232491], dtype=uint64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is related to how I coerce to an array above. When using np.array([val])
, I don't get object arrays. But inferring the dtype with infer_dtype_from_array
gives some overhead (but probably not too much)
pandas/core/util/hashing.py
Outdated
hash | ||
|
||
""" | ||
hashes = (hash_array(np.array([v]), encoding=encoding, hash_key=hash_key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether np.array([v])
will always do the correct coercion of the type. In the hash_tuples
version, this coercion to dtyped arrays happens in MultiIndex.from_tuples(vals)
More in general, large part of the remaining time is spent in the hash_array
function. A more specialized hasher for single scalars would probably further improve it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a single element np.array
will work, but if you have multiple elements it will not, use this insteadIn [10]: from pandas.core.dtypes.cast import infer_dtype_from_array
from pandas.core.dtypes.cast import infer_dtype_from_array
[12]: infer_dtype_from_array(['foo', np.nan])
Out[12]: (numpy.object_, ['foo', nan])
In [14]: dtype, arr = infer_dtype_from_array(['foo', np.nan])
In [15]: np.array(arr, dtype=dtype)
Out[15]: array(['foo', nan], dtype=object)
In [16]: np.array(['foo', np.nan])
Out[16]:
array(['foo', 'nan'],
dtype='<U3')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if you have multiple elements it will not
it will always be a single element I think (it are the individual elements of the single tuple that are put in an array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy converts strings to fixed length which is wrong
use routine to have it corrrect
Codecov Report
@@ Coverage Diff @@
## master #16346 +/- ##
==========================================
- Coverage 90.38% 90.36% -0.02%
==========================================
Files 161 161
Lines 50916 50920 +4
==========================================
- Hits 46021 46015 -6
- Misses 4895 4905 +10
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16346 +/- ##
==========================================
- Coverage 90.38% 90.37% -0.01%
==========================================
Files 161 161
Lines 50949 50963 +14
==========================================
+ Hits 46052 46060 +8
- Misses 4897 4903 +6
Continue to review full report at Codecov.
|
@@ -921,6 +923,16 @@ cdef class MultiIndexHashTable(HashTable): | |||
"hash collision\nlocs:\n{}\n" | |||
"result:\n{}\nmi:\n{}".format(alocs, result, mi)) | |||
|
|||
def _check_for_collision(self, Py_ssize_t loc, object label): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this cdef
, its hitting overhead because it has to go to python and back to cython
|
||
|
||
def _hash_scalar(val, encoding='utf8', hash_key=None): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is duplicating lots of code. I sure there is a way to do this a bit more generically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, but this was mainly to test.
And from that it, this makes it a lot faster than using the hash_array
(the commented part in hash_tuple
). So not sure how to solve that. In principle I can put the common parts in helper functions (eg the redistributing part), but for most of it it is not possible as there are slight differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does all of this buy you? (IOW can you post updated timings)?
maintaining a separate code path for scalars will cause future issues. these will need to be kept in sync (with the array hashing), if any code changes are made. you can easily share code here which would make this more palatable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [4]: %timeit pd.core.util.hashing.hash_tuple2((999, np.nan, 'E'))
380 µs ± 60.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %timeit pd.core.util.hashing.hash_tuple((999, np.nan, 'E'))
81.8 µs ± 3.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
hash_tuple2
uses the hash_array
(the commented out version in the current branch), the hash_tuple
uses the hash_scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, its prob reasonable, but willing to sacrifice some perf to get some shared code (IOW between old and new prob a good compromise to repeating lots of code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I pushed a new version that almost fully reuses the hash_array
function for the actual hashing, only has some specific logic before that to convert the scalar to a proper array.
This reduces a lot of the code duplication, and has only minor perf impact.
Apart from that, it still has the separate code path to first convert the scalar to an array, which might be a bit brittle, and I agree certainly not ideal for code maintenance, but using something more general (eg infer_dtype) gives a big perf penalty.
pandas/core/util/hashing.py
Outdated
else: | ||
if not string_like: | ||
from pandas import Index | ||
vals = Index(vals).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a bit an ugly part. I use Index(vals)
to get the correct type coercion (eg for a Timestamp
object, to ensure it does the same as hash_array
). But I don't do this in the beginning because this is much slower than needed for simple numerical values.
It would be nice to have a utility function to convert a list of values to an array with the dtype "how pandas wants it" (but maybe that exists and I just don't know about it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a utility function to convert a list of values to an array with the dtype
this is what Series/Index constructors do. About the most complicated code that exists.
fcaa5bb
to
afa8775
Compare
pandas/core/util/hashing.py
Outdated
else: | ||
vals = np.array([val]) | ||
|
||
if vals.dtype == np.object_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should just call infer_dtype_from_scalar
instead of reinveting the wheel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you can only just call that if you know it exists :-)
But thanks for the pointer, that is exactly I was hoping already existed.
I made a slight adaptation (extra keyword) to get the behaviour I need here, but I can also put that in he hash_scalar method instead of in infer_dtype_from_scalar (but, there it is a bit easier to do).
pandas/tests/util/test_hashing.py
Outdated
@@ -79,6 +79,23 @@ def test_hash_tuples(self): | |||
result = hash_tuples(tups[0]) | |||
assert result == expected[0] | |||
|
|||
def test_hash_tuple(self): | |||
# test equivalence between hash_tuples and hash_tuple | |||
for tup in [(1, 'one'), (1, np.nan)]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test with pd.Nat as well
15fca29
to
8ba9b6f
Compare
pandas/core/dtypes/cast.py
Outdated
@@ -333,7 +333,7 @@ def maybe_promote(dtype, fill_value=np.nan): | |||
return dtype, fill_value | |||
|
|||
|
|||
def infer_dtype_from_scalar(val, pandas_dtype=False): | |||
def infer_dtype_from_scalar(val, pandas_dtype=False, use_datetimetz=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this extra parameter, instead you can pass pandas_dtype=True
. what is the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas_dtype=True
does not return a np.datetime64 but our DatetimeTZDtype, and further then also Periods get converted, which I also don't need.
So the combination I need, i.e. tz-aware timestamps to its numpy dtype instead of as pandas extension type or as object, but keep Periods as objects, is not possible with the current options. This is a bit a strange combination, but that's the consequence of how those are returned from the Index values (which is as datetime64 without tz but Periods as objects), which is how they are hashed.
But indeed, if we add this, ignore_tz
is probably a better name
if isnull(val): | ||
# this is to be consistent with the _hash_categorical implementation | ||
return np.array([np.iinfo(np.uint64).max], dtype='u8') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you need to handle datetime w/tz directly (IOW, we basically ignore it), then I would.
if getattr(val, 'tzinfo', None) is not None:
val = val.tz_localize(None)
I suppose an option to ignore tz is fine for infer_dtype_from_scalar
, but if you add it I would rename, document and test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I can certainly do that check here as well.
That is maybe better to keep the custom logic here, as the keyword added to infer_dtype_from_scalar
would not be used anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
he i think better locally
c29dab9
to
8acc9e8
Compare
Ran the benchmarks with the latest version of this PR: Against master:
Against 0.19.2 (which combines slowdown/improvement of the several PRs), also gives an improvement now for the that benchmark:
|
# for tz-aware datetimes, we need the underlying naive UTC value and | ||
# not the tz aware object or pd extension type (as | ||
# infer_dtype_from_scalar would do) | ||
if not isinstance(val, tslib.Timestamp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #16372 (for later)
@jorisvandenbossche lgtm. merge when ready (for 0.20.2) is good. |
@jreback thanks for the review! |
nice fix @jorisvandenbossche ! |
* upstream/master: (48 commits) BUG: Categorical comparison with unordered (pandas-dev#16339) ENH: Adding 'protocol' parameter to 'to_pickle'. PERF: improve MultiIndex get_loc performance (pandas-dev#16346) TST: remove pandas-datareader xfail as 0.4.0 works (pandas-dev#16374) TST: followup to pandas-dev#16364, catch errstate warnings (pandas-dev#16373) DOC: new oauth token TST: Add test for clip-na (pandas-dev#16369) ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#16315) MAINT: Add .iml to .gitignore (pandas-dev#16368) BUG/API: Categorical constructor scalar categories (pandas-dev#16340) ENH: Provide dict object for to_dict() pandas-dev#16122 (pandas-dev#16220) PERF: improved clip performance (pandas-dev#16364) DOC: try new token for docs DOC: try with new secure token DOC: add developer section to the docs DEPS: Drop Python 3.4 support (pandas-dev#16303) DOC: remove credential helper DOC: force fetch on build docs DOC: redo dev docs access token DOC: add dataframe construction in merge_asof example (pandas-dev#16348) ...
* PERF: improve hash collision check for single MI labels * PERF: specialized hash function for single tuples
* PERF: improve hash collision check for single MI labels * PERF: specialized hash function for single tuples (cherry picked from commit 34ebad8)
* PERF: improve hash collision check for single MI labels * PERF: specialized hash function for single tuples (cherry picked from commit 34ebad8)
* PERF: improve hash collision check for single MI labels * PERF: specialized hash function for single tuples
While I was timing the MultiIndex get_loc in #16324, I saw some 'quick wins' with some further profiling.
Rationale:
_check_for_collisions
. This function was a generic one for one or multiple labels. I added a slightly adapted version specialized for a single tuplehash_tuples
. Again, added specialized version for a single tuple.@jreback I have some questions (will add them as comments), as this is not my specialty, but good opportunity to get more familiar with the inner guts :-)
Some timings using
master:
PR:
So something between a 5x and 15x improvement (big variability between individual timings)