Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improve MultiIndex get_loc performance #16346

Merged

Conversation

jorisvandenbossche
Copy link
Member

While I was timing the MultiIndex get_loc in #16324, I saw some 'quick wins' with some further profiling.

Rationale:

  • big part of the time was spent in checking for a hash collision: _check_for_collisions. This function was a generic one for one or multiple labels. I added a slightly adapted version specialized for a single tuple
  • another important factor was the hash creation with hash_tuples. Again, added specialized version for a single tuple.

@jreback I have some questions (will add them as comments), as this is not my specialty, but good opportunity to get more familiar with the inner guts :-)

Some timings using

import string
mi_large = pd.MultiIndex.from_product(
    [np.arange(1000),
     np.arange(20), list(string.ascii_letters)],
    names=['one', 'two', 'three'])

master:

In [2]: %time mi_large.get_loc((999, 19, 'E'))   # <--- first slower cold one
CPU times: user 140 ms, sys: 8 ms, total: 148 ms
Wall time: 145 ms
Out[2]: 1039978

In [3]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 4.33 ms
Out[3]: 1039978

In [4]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 4.18 ms
Out[4]: 1039978

In [5]: %timeit mi_large.get_loc((999, 19, 'E'))
100 loops, best of 3: 2.56 ms per loop

In [6]: %%timeit
   ...: for _ in range(1000):
   ...:     mi_large.get_loc((999, 19, 'E'))
   ...: 
1 loop, best of 3: 2.62 s per loop

PR:

In [3]: %time mi_large.get_loc((999, 19, 'E'))   # <--- first slower cold one
CPU times: user 128 ms, sys: 24 ms, total: 152 ms
Wall time: 152 ms
Out[3]: 1039978

In [4]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 716 µs
Out[4]: 1039978

In [5]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 371 µs
Out[5]: 1039978

In [6]: %timeit mi_large.get_loc((999, 19, 'E'))
1000 loops, best of 3: 189 µs per loop

In [15]: %%timeit
    ...: for _ in range(1000):
    ...:     mi_large.get_loc((999, 19, 'E'))
    ...: 
1 loop, best of 3: 188 ms per loop

So something between a 5x and 15x improvement (big variability between individual timings)

@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label May 12, 2017
# version of _check_for_collisions above for single label (tuple)

result = self.mi[loc]
if not array_equivalent(result, label):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to compare 2 tuples here, but cannot just do result == equal as this would not compare equal when there are NaNs.
For now I used array_equivalent which does what I want (but still has some overhead). But do we have other utility to do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [4]: %timeit array_equivalent((1, 2, np.nan), (1, 2, np.nan))
34.6 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: compare = lambda x, y: all(np.isnan(e) and np.isnan(f) or e == f for e, f in zip(x, y))

In [8]: compare((1,2, np.nan), (1, 2, np.nan))
Out[8]: True

In [9]: %timeit compare((1,2, np.nan), (1, 2, np.nan))
5.37 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much simpler (and faster)!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we're optimizing, math.isnan is quite a bit faster for scalars:

In [3]: %timeit compare((1,2, np.nan), (1, 2, np.nan))
4.14 µs ± 66.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]: import math

In [5]: compare2 = lambda x, y: all(math.isnan(e) and math.isnan(f) or e == f for e, f in zip(x, y))

In [6]: %timeit compare2((1,2, np.nan), (1, 2, np.nan))
1.34 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I think you need to use isnull; both isnan and math.isnan are not complete (IOW they won't for example handle NaT)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but since this is cython, you use lib.is_null_datetimelike (covers the bases of None, nan, NaT)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wanted to test whether this comparison was working on a real usecase with NaNs in the index, but this seems to not work anyway (irregardless of this collision check):

In [11]: mi = pd.MultiIndex.from_product([['A', 'B'], [1, np.nan]])

In [12]: mi
Out[12]: 
MultiIndex(levels=[['A', 'B'], [1]],
           labels=[[0, 0, 1, 1], [0, -1, 0, -1]])

In [13]: mi.get_loc(('A', np.nan))
...
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: ('A', nan)

The above is with 0.19.2, so this was already failing with the PyObject hashtable

@@ -264,7 +287,7 @@ def hash_array(vals, encoding='utf8', hash_key=None, categorize=True):

try:
vals = hashing.hash_object_array(vals, hash_key, encoding)
except TypeError:
except (TypeError, ValueError):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to add this ValueError because I got the following error with the line above:

In [6]: pd.core.util.hashing.hash_array(np.array(['E']), categorize=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-6197ea2d3075> in <module>()
----> 1 pd.core.util.hashing.hash_array(np.array(['E']), categorize=False)

/home/joris/scipy/pandas/pandas/core/util/hashing.py in hash_array(vals, encoding, hash_key, categorize)
    286 
    287         try:
--> 288             vals = hashing.hash_object_array(vals, hash_key, encoding)
    289         except TypeError:
    290             # we have mixed types

/home/joris/scipy/pandas/pandas/_libs/hashing.pyx in pandas._libs.hashing.hash_object_array (pandas/_libs/hashing.c:1608)()

ValueError: Does not understand character buffer dtype format string ('w')

so this gives a ValueError and not a TypeError, while the version in the except does work. But I don't know it is OK to just catch that as well, or whether I should rather change the hash_object_array array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hashing doesn't handle fixed-length arrays (it prob could, but we don't generally have them). if its an object array it will work.

In [21]: np.array(['E'])
Out[21]: 
array(['E'], 
      dtype='<U1')

In [22]: np.array(['E'], dtype=object)
Out[22]: array(['E'], dtype=object)

In [20]: pd.core.util.hashing.hash_array(np.array(['E'], dtype=object), categorize=False)
Out[20]: array([16179053688037232491], dtype=uint64)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is related to how I coerce to an array above. When using np.array([val]), I don't get object arrays. But inferring the dtype with infer_dtype_from_array gives some overhead (but probably not too much)

hash

"""
hashes = (hash_array(np.array([v]), encoding=encoding, hash_key=hash_key,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether np.array([v]) will always do the correct coercion of the type. In the hash_tuples version, this coercion to dtyped arrays happens in MultiIndex.from_tuples(vals)

More in general, large part of the remaining time is spent in the hash_array function. A more specialized hasher for single scalars would probably further improve it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a single element np.array will work, but if you have multiple elements it will not, use this insteadIn [10]: from pandas.core.dtypes.cast import infer_dtype_from_array

from pandas.core.dtypes.cast import infer_dtype_from_array

 [12]: infer_dtype_from_array(['foo', np.nan])
Out[12]: (numpy.object_, ['foo', nan])

In [14]: dtype, arr = infer_dtype_from_array(['foo', np.nan])

In [15]: np.array(arr, dtype=dtype)
Out[15]: array(['foo', nan], dtype=object)

In [16]: np.array(['foo', np.nan])
Out[16]: 
array(['foo', 'nan'], 
      dtype='<U3')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if you have multiple elements it will not

it will always be a single element I think (it are the individual elements of the single tuple that are put in an array)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy converts strings to fixed length which is wrong

use routine to have it corrrect

@jorisvandenbossche jorisvandenbossche changed the title Perf mi get loc hash PERF: improve MultiIndex get_loc performance May 12, 2017
@codecov
Copy link

codecov bot commented May 12, 2017

Codecov Report

Merging #16346 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16346      +/-   ##
==========================================
- Coverage   90.38%   90.36%   -0.02%     
==========================================
  Files         161      161              
  Lines       50916    50920       +4     
==========================================
- Hits        46021    46015       -6     
- Misses       4895     4905      +10
Flag Coverage Δ
#multiple 88.14% <100%> (-0.01%) ⬇️
#single 40.22% <14.28%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/util/hashing.py 90.67% <100%> (+0.32%) ⬆️
pandas/core/indexes/multi.py 96.56% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️
pandas/core/frame.py 97.68% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bdbcb6...e88b658. Read the comment docs.

@codecov
Copy link

codecov bot commented May 12, 2017

Codecov Report

Merging #16346 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16346      +/-   ##
==========================================
- Coverage   90.38%   90.37%   -0.01%     
==========================================
  Files         161      161              
  Lines       50949    50963      +14     
==========================================
+ Hits        46052    46060       +8     
- Misses       4897     4903       +6
Flag Coverage Δ
#multiple 88.15% <100%> (ø) ⬆️
#single 40.2% <26.31%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 96.56% <100%> (ø) ⬆️
pandas/core/util/hashing.py 92.96% <100%> (+2.61%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.69% <0%> (-0.1%) ⬇️
pandas/core/common.py 91.39% <0%> (+0.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d3bef8...8acc9e8. Read the comment docs.

@@ -921,6 +923,16 @@ cdef class MultiIndexHashTable(HashTable):
"hash collision\nlocs:\n{}\n"
"result:\n{}\nmi:\n{}".format(alocs, result, mi))

def _check_for_collision(self, Py_ssize_t loc, object label):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this cdef, its hitting overhead because it has to go to python and back to cython



def _hash_scalar(val, encoding='utf8', hash_key=None):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is duplicating lots of code. I sure there is a way to do this a bit more generically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know, but this was mainly to test.
And from that it, this makes it a lot faster than using the hash_array (the commented part in hash_tuple). So not sure how to solve that. In principle I can put the common parts in helper functions (eg the redistributing part), but for most of it it is not possible as there are slight differences.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does all of this buy you? (IOW can you post updated timings)?

maintaining a separate code path for scalars will cause future issues. these will need to be kept in sync (with the array hashing), if any code changes are made. you can easily share code here which would make this more palatable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [4]: %timeit pd.core.util.hashing.hash_tuple2((999, np.nan, 'E'))
380 µs ± 60.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit pd.core.util.hashing.hash_tuple((999, np.nan, 'E'))
81.8 µs ± 3.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

hash_tuple2 uses the hash_array (the commented out version in the current branch), the hash_tuple uses the hash_scalar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, its prob reasonable, but willing to sacrifice some perf to get some shared code (IOW between old and new prob a good compromise to repeating lots of code)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I pushed a new version that almost fully reuses the hash_array function for the actual hashing, only has some specific logic before that to convert the scalar to a proper array.
This reduces a lot of the code duplication, and has only minor perf impact.

Apart from that, it still has the separate code path to first convert the scalar to an array, which might be a bit brittle, and I agree certainly not ideal for code maintenance, but using something more general (eg infer_dtype) gives a big perf penalty.

else:
if not string_like:
from pandas import Index
vals = Index(vals).values
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a bit an ugly part. I use Index(vals) to get the correct type coercion (eg for a Timestamp object, to ensure it does the same as hash_array). But I don't do this in the beginning because this is much slower than needed for simple numerical values.
It would be nice to have a utility function to convert a list of values to an array with the dtype "how pandas wants it" (but maybe that exists and I just don't know about it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a utility function to convert a list of values to an array with the dtype

this is what Series/Index constructors do. About the most complicated code that exists.

@jorisvandenbossche jorisvandenbossche force-pushed the perf-mi-get_loc-hash branch 3 times, most recently from fcaa5bb to afa8775 Compare May 15, 2017 17:37
else:
vals = np.array([val])

if vals.dtype == np.object_:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should just call infer_dtype_from_scalar instead of reinveting the wheel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you can only just call that if you know it exists :-)

But thanks for the pointer, that is exactly I was hoping already existed.

I made a slight adaptation (extra keyword) to get the behaviour I need here, but I can also put that in he hash_scalar method instead of in infer_dtype_from_scalar (but, there it is a bit easier to do).

@@ -79,6 +79,23 @@ def test_hash_tuples(self):
result = hash_tuples(tups[0])
assert result == expected[0]

def test_hash_tuple(self):
# test equivalence between hash_tuples and hash_tuple
for tup in [(1, 'one'), (1, np.nan)]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test with pd.Nat as well

@@ -333,7 +333,7 @@ def maybe_promote(dtype, fill_value=np.nan):
return dtype, fill_value


def infer_dtype_from_scalar(val, pandas_dtype=False):
def infer_dtype_from_scalar(val, pandas_dtype=False, use_datetimetz=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this extra parameter, instead you can pass pandas_dtype=True . what is the issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas_dtype=True does not return a np.datetime64 but our DatetimeTZDtype, and further then also Periods get converted, which I also don't need.
So the combination I need, i.e. tz-aware timestamps to its numpy dtype instead of as pandas extension type or as object, but keep Periods as objects, is not possible with the current options. This is a bit a strange combination, but that's the consequence of how those are returned from the Index values (which is as datetime64 without tz but Periods as objects), which is how they are hashed.

But indeed, if we add this, ignore_tz is probably a better name

if isnull(val):
# this is to be consistent with the _hash_categorical implementation
return np.array([np.iinfo(np.uint64).max], dtype='u8')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you need to handle datetime w/tz directly (IOW, we basically ignore it), then I would.

if getattr(val, 'tzinfo', None) is not None:
    val = val.tz_localize(None)

I suppose an option to ignore tz is fine for infer_dtype_from_scalar, but if you add it I would rename, document and test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I can certainly do that check here as well.

That is maybe better to keep the custom logic here, as the keyword added to infer_dtype_from_scalar would not be used anywhere else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

he i think better locally

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 16, 2017

Ran the benchmarks with the latest version of this PR:

Against master:

$ asv continuous upstream/master HEAD -b time_multiindex
...
     before      after     ratio
  [0ea0f25b] [c29dab95]
-     2.73s   298.28ms      0.11  indexing.MultiIndexing.time_multiindex_large_get_loc_warm
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Against 0.19.2 (which combines slowdown/improvement of the several PRs), also gives an improvement now for the that benchmark:

$ asv continuous v0.19.2 HEAD -b time_multiindex
...
    before     after       ratio
  [6f525eec] [c29dab95]
-  744.29ms   334.65ms      0.45  indexing.MultiIndexing.time_multiindex_large_get_loc_warm
-  716.03ms   175.14ms      0.24  indexing.MultiIndexing.time_multiindex_large_get_loc
-  736.39ms   166.29ms      0.23  indexing.MultiIndexing.time_multiindex_get_indexer
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

# for tz-aware datetimes, we need the underlying naive UTC value and
# not the tz aware object or pd extension type (as
# infer_dtype_from_scalar would do)
if not isinstance(val, tslib.Timestamp):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #16372 (for later)

@jreback jreback added this to the 0.20.2 milestone May 16, 2017
@jreback
Copy link
Contributor

jreback commented May 16, 2017

@jorisvandenbossche lgtm. merge when ready (for 0.20.2) is good.

@jreback jreback closed this May 16, 2017
@jreback jreback reopened this May 16, 2017
@jorisvandenbossche jorisvandenbossche merged commit 34ebad8 into pandas-dev:master May 17, 2017
@jorisvandenbossche
Copy link
Member Author

@jreback thanks for the review!

@jreback
Copy link
Contributor

jreback commented May 17, 2017

nice fix @jorisvandenbossche !

pawroman added a commit to pawroman/pandas that referenced this pull request May 18, 2017
* upstream/master: (48 commits)
  BUG: Categorical comparison with unordered (pandas-dev#16339)
  ENH: Adding 'protocol' parameter to 'to_pickle'.
  PERF: improve MultiIndex get_loc performance (pandas-dev#16346)
  TST: remove pandas-datareader xfail as 0.4.0 works (pandas-dev#16374)
  TST: followup to pandas-dev#16364, catch errstate warnings (pandas-dev#16373)
  DOC: new oauth token
  TST: Add test for clip-na (pandas-dev#16369)
  ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#16315)
  MAINT: Add .iml to .gitignore (pandas-dev#16368)
  BUG/API: Categorical constructor scalar categories (pandas-dev#16340)
  ENH: Provide dict object for to_dict() pandas-dev#16122 (pandas-dev#16220)
  PERF: improved clip performance (pandas-dev#16364)
  DOC: try new token for docs
  DOC: try with new secure token
  DOC: add developer section to the docs
  DEPS: Drop Python 3.4 support (pandas-dev#16303)
  DOC: remove credential helper
  DOC: force fetch on build docs
  DOC: redo dev docs access token
  DOC: add dataframe construction in merge_asof example (pandas-dev#16348)
  ...
pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
* PERF: improve hash collision check for single MI labels
* PERF: specialized hash function for single tuples
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request May 29, 2017
* PERF: improve hash collision check for single MI labels
* PERF: specialized hash function for single tuples
(cherry picked from commit 34ebad8)
TomAugspurger pushed a commit that referenced this pull request May 30, 2017
* PERF: improve hash collision check for single MI labels
* PERF: specialized hash function for single tuples
(cherry picked from commit 34ebad8)
stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017
* PERF: improve hash collision check for single MI labels
* PERF: specialized hash function for single tuples
@jorisvandenbossche jorisvandenbossche deleted the perf-mi-get_loc-hash branch November 27, 2017 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants