⚡️ Speed up function correlation by 31,656%
#180
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 31,656% (316.56x) speedup for
correlationinsrc/statistics/descriptive.py⏱️ Runtime :
1.39 seconds→4.38 milliseconds(best of250runs)📝 Explanation and details
The optimized code achieves a remarkable 316x speedup by replacing inefficient row-by-row DataFrame access with vectorized NumPy operations.
Key optimizations:
Pre-extraction of data arrays: Instead of repeatedly calling
df.iloc[k][col]for each row (which is extremely slow), the code extracts all numeric columns as NumPy arrays upfront usingdf[col].to_numpy(). This eliminates the major bottleneck visible in the line profiler wheredf.iloccalls consumed 78.7% of execution time.Vectorized NaN detection: Rather than checking
pd.isna()for each individual cell in nested loops, it pre-computes boolean masks usingnp.isnan()for entire columns, then uses logical operations (~(isnan_i | isnan_j)) to find valid row pairs.Boolean masking for data selection: Uses NumPy's boolean indexing (
arr_i[valid_mask]) to extract only the valid data points for each column pair, eliminating the need to build Python lists element by element.Batch statistical calculations: All statistical computations (mean, variance, covariance) now use
np.sum()on arrays instead of Python'ssum()on lists, leveraging NumPy's optimized C implementations.The line profiler shows the original code spent most time in DataFrame access operations, while the optimized version spreads computation more evenly across NumPy operations. This optimization is particularly effective for the test cases involving large DataFrames (1000+ rows), where vectorized operations show their greatest advantage over element-wise Python loops.
The correlation computation logic and handling of edge cases (NaNs, zero variance) remain identical, ensuring full behavioral compatibility.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-correlation-midsvju6and push.