Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/PERF: dispatch is_monotonic_increasing / decreasing ? #56619

Open
lukemanley opened this issue Dec 25, 2023 · 4 comments
Open

ENH/PERF: dispatch is_monotonic_increasing / decreasing ? #56619

lukemanley opened this issue Dec 25, 2023 · 4 comments
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance

Comments

@lukemanley
Copy link
Member

Is it worth dispatching is_monotonic_increasing / is_monotonic_decreasing for EAs?

The cython implemention is early-stopping, but the benefit disappears if the data needs to be copied into an object array as in the example below:

import pandas as pd

values = [f"val_{i:07}" for i in range(1_000_000)]
ser = pd.Series(values, dtype="string[pyarrow_numpy]")

%timeit ser.is_monotonic_increasing
# 219 ms ± 20.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pc.all(pc.greater_equal(ser.array._pa_array[1:], ser.array._pa_array[:-1]))
# 19.2 ms ± 585 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

random order example with cython early-stopping:

ser2 = ser.sample(frac=1.0)

%timeit ser2.is_monotonic_increasing
# 152 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pc.all(pc.greater_equal(ser2.array._pa_array[1:], ser2.array._pa_array[:-1]))
# 15 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@lukemanley lukemanley added Performance Memory or execution speed performance ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 25, 2023
@rhshadrach rhshadrach added the Needs Discussion Requires discussion from core team before further action label Dec 26, 2023
@phofl
Copy link
Member

phofl commented Dec 27, 2023

+1 for anything that's non-numeric

How does this look if we convert to numpy without copying? I think similar but want to be sure

@jbrockmendel
Copy link
Member

+1. e.g. pyarrow EAs would be able to cache it. And ExtensionEngine has a hack going through _rank.

@lukemanley
Copy link
Member Author

lukemanley commented Dec 30, 2023

pyarrow EAs would be able to cache it.

@jbrockmendel - thats an interesting idea. I assume the same could be done for is_unique. I suspect the current mutability (e.g. setitem) would complicate the caching a bit?

@jbrockmendel
Copy link
Member

yah calling __setitem__ would have to invalidate the cache. though i imagine that a lot of the relevant info is cached on the pa.ChunkedArray object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants