⚡️ Speed up method Categorical.equals by 1,129%
#103
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 1,129% (11.29x) speedup for
Categorical.equalsinpandas/core/arrays/categorical.py⏱️ Runtime :
155 microseconds→12.6 microseconds(best of17runs)📝 Explanation and details
The optimized code achieves a 12x speedup by eliminating expensive hash computations and avoiding unnecessary object allocations in the
equalsmethod.Key optimizations:
Early reference equality check: Added
if self is other: return Trueto immediately return for identical objects - a common case in pandas workflows where the same Categorical is compared to itself.Avoided expensive hash computation: The original code called
hash(self.dtype) == hash(other.dtype)which is computationally expensive. The optimized version:self_dtype is other_dtype).orderedfields and usesIndex.equals()for categories comparisonInlined recoding logic: Instead of calling
self._encode_with_my_categories(other)which creates a temporary Categorical object, the optimized version directly callsrecode_for_categories()and compares codes, eliminating object allocation overhead.Optimized
_categories_match_up_to_permutation: Similarly avoids hash computation by doing direct field comparisons first.Performance characteristics: These optimizations are particularly effective for:
cat.equals(cat))The line profiler shows the original
hash()call took 196ms (100% of time), while the optimized version's field comparisons take only 27ms (70% of total time), with the overall method running 12x faster.✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
arrays/categorical/test_operators.py::TestCategoricalOps.test_compare_unordered_different_orderTo edit these changes
git checkout codeflash/optimize-Categorical.equals-mhby1z89and push.