Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: avoid unnecessary method call in get_indexer_non_unique() on MultiIndex #55811

Merged

Conversation

morotti
Copy link
Contributor

@morotti morotti commented Nov 3, 2023

Hello,

There was a regression that made reindex 60x slower, in pandas v0.23 to v1.4, that was finally fixed by this PR last year
discussion #23735
commit https://github.com/pandas-dev/pandas/pull/46235/files
a benchmark was added in a later commit https://github.com/pandas-dev/pandas/pull/47221/files

That resolved the issue in get_indexer().
I was going through pandas source code and I noticed there is a second code path in get_indexer_non_unique() with the same problem.
This PR fixes the remaining code path.

I'm not super familiar with pandas, I'm not sure what might hit the default indexer vs the non unique indexer.
It would be great if you can think of something and add a benchmark.
It looks like the exact same bug and nearly the same code to me, it might just be a 60x performance improvement too 😄

@jreback
@lukemanley
@jbrockmendel

Cheers.

@morotti morotti force-pushed the perf-reindex-non-unique-multiindex branch from b0b4b49 to ebfff8e Compare November 3, 2023 13:42
@mroeschke mroeschke requested a review from lukemanley November 3, 2023 17:04
@mroeschke mroeschke added Performance Memory or execution speed performance MultiIndex labels Nov 3, 2023
@morotti
Copy link
Contributor Author

morotti commented Nov 6, 2023

Thank you,

I've done some further analysis comparing to the previous PR that fixed the first code path.
There was a batch of PRs that were merged around the same time and fixed the performance issue in 1.4, all together there was a 100x slowdown on reindex (and get_indexer) but that doesn't seem to affect get_indexer_non_unique.

I think the PR is good to merge to avoid the extra function call, though it hardly makes any difference on performance.

on a side note, there appear to be a performance regression in reindex between 1.5.3 and 2.0.3 by as much as 10%
it's quite small and close to the noise between runs.

reindex benchmark

pandas 0.22 = 0.020 seconds
pandas 1.0 = 1.431 seconds
pandas 1.3 = 2.164 seconds
pandas 1.5 = 0.020 seconds
pandas 1.5 with PR = 0.020 seconds
pandas 2.0 = 0.023 seconds
pandas 2.0 with PR  0.023 seconds
get_index_non_unique benchmark
pandas 0.22 = 0.550 seconds
pandas 1.0 = 0.071 seconds
pandas 1.3 = 0.074 seconds
pandas 1.5 = 0.063 seconds
pandas 1.5 with PR = 0.063 seconds
pandas 2.0 = 0.064 seconds
pandas 2.0 with PR = 0.064 seconds

(ignore the 1ms difference that is noise between runs)

# benchmark for reindex and get_indexer
import pandas as pd
import time
import numpy as np


if __name__ == '__main__':
    n_days = 300
    dr = pd.date_range(end="20181118", periods=n_days)
    mi = pd.MultiIndex.from_product([dr, range(1440)])

    v = np.random.randn(len(mi))
    mask = np.random.rand(len(v)) < .3
    v[mask] = np.nan
    s = pd.Series(v, index=mi)
    s = s.sort_index()

    s2 = s.dropna()

    start = time.time()

    s2.reindex(index=s.index)

    end = time.time()
    print("pandas version: %s" % pd.__version__)
    print("reindex took %s seconds" % (end - start))
# benchmark for non-unique
import pandas as pd
import time
import numpy as np


if __name__ == '__main__':
    N = 1_000_000

    mi1 = pd.MultiIndex.from_arrays([np.ones(N), np.ones(N)])
    mi2 = pd.MultiIndex.from_tuples([(1, 1)])

    start = time.time()

    mi1.get_indexer_non_unique(mi2)

    end = time.time()
    print("pandas version: %s" % pd.__version__)
    print("reindex took %s seconds" % (end - start))

@lukemanley
Copy link
Member

Thanks. Can you remove the whatsnew note in that case and maybe retitle this PR to something like: "CLN: avoid unnecessary method call in ..."

@lukemanley lukemanley added Clean and removed Performance Memory or execution speed performance labels Nov 6, 2023
@morotti morotti force-pushed the perf-reindex-non-unique-multiindex branch from ebfff8e to 7ca470d Compare November 7, 2023 20:14
@morotti morotti changed the title PERF: yet another 60x improvements to reindex on MultiIndex PERF: avoid unnecessary method call in get_indexer_non_unique() on MultiIndex Nov 7, 2023
@morotti
Copy link
Contributor Author

morotti commented Nov 7, 2023

adjusted commits and PR titles

@mroeschke mroeschke added this to the 2.2 milestone Nov 7, 2023
@mroeschke mroeschke merged commit 51f3d03 into pandas-dev:main Nov 7, 2023
@mroeschke
Copy link
Member

Thanks @morotti

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64
3 participants