Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: cache sorted data in GroupBy? #51077

Open
jbrockmendel opened this issue Jan 31, 2023 · 3 comments
Open

PERF: cache sorted data in GroupBy? #51077

jbrockmendel opened this issue Jan 31, 2023 · 3 comments
Assignees
Labels
Enhancement Groupby Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Jan 31, 2023

When we do a groupby transform/reduce that requires operating group-by-group, we construct a sorted (DataFrame|Series) so that we can iterate over it efficiently. That construction is cached within a DataSplitter class, but the splitter itself is not cached. IIUC we can get some mileage by caching the DataSplitter, at the possible cost of having a copy hang around longer than we might want.

Also we have a separate construct-a-sorted-object path in _numba_prep that might be able to re-use some code.

Final thought: we could check in DataSplitter.sorted_data whether _sort_idx is monotonic, in which case the (DataFrame|Series) is already sorted and we don't need to make a copy.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 31, 2023
@rhshadrach rhshadrach added Enhancement Groupby Performance Memory or execution speed performance Internals Related to non-user accessible pandas implementation and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Jan 31, 2023
@jbrockmendel
Copy link
Member Author

if we iron out selected_obj vs obj_with_exclusions we might be able to go further and cache sorted_data. there are a few other arguments that can be passed to get_splitter, but i think we can plausibly get that down to just one

@lithomas1
Copy link
Member

@jbrockmendel

Tangentially related, but do you think we might be able to avoid materializing the entire sorted DF in DataSplitter.iter?

IIUC, since DataSplitter takes in the argsorted idxs, instead of materializing the whole array with take in __iter__, can we slice the argsort idxs with the current start/end, and do a take from the DataFrame with those indexes?

I could take a closer look at this if you think it's the right approach.

@jbrockmendel
Copy link
Member Author

Worth a shot.

BTW I think we can avoid a take altogether in already-sorted cases by checking if sort_idx is monotonic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants