PERF: cache sorted data in GroupBy? #51077
Labels
Enhancement
Groupby
Internals
Related to non-user accessible pandas implementation
Performance
Memory or execution speed performance
When we do a groupby transform/reduce that requires operating group-by-group, we construct a sorted (DataFrame|Series) so that we can iterate over it efficiently. That construction is cached within a DataSplitter class, but the splitter itself is not cached. IIUC we can get some mileage by caching the DataSplitter, at the possible cost of having a copy hang around longer than we might want.
Also we have a separate construct-a-sorted-object path in _numba_prep that might be able to re-use some code.
Final thought: we could check in DataSplitter.sorted_data whether _sort_idx is monotonic, in which case the (DataFrame|Series) is already sorted and we don't need to make a copy.
The text was updated successfully, but these errors were encountered: