Commit d5f16e8
authored
Optimize correlation
The optimized code achieves a remarkable **316x speedup** by replacing inefficient row-by-row DataFrame access with vectorized NumPy operations.
**Key optimizations:**
1. **Pre-extraction of data arrays**: Instead of repeatedly calling `df.iloc[k][col]` for each row (which is extremely slow), the code extracts all numeric columns as NumPy arrays upfront using `df[col].to_numpy()`. This eliminates the major bottleneck visible in the line profiler where `df.iloc` calls consumed 78.7% of execution time.
2. **Vectorized NaN detection**: Rather than checking `pd.isna()` for each individual cell in nested loops, it pre-computes boolean masks using `np.isnan()` for entire columns, then uses logical operations (`~(isnan_i | isnan_j)`) to find valid row pairs.
3. **Boolean masking for data selection**: Uses NumPy's boolean indexing (`arr_i[valid_mask]`) to extract only the valid data points for each column pair, eliminating the need to build Python lists element by element.
4. **Batch statistical calculations**: All statistical computations (mean, variance, covariance) now use `np.sum()` on arrays instead of Python's `sum()` on lists, leveraging NumPy's optimized C implementations.
The line profiler shows the original code spent most time in DataFrame access operations, while the optimized version spreads computation more evenly across NumPy operations. This optimization is particularly effective for the test cases involving large DataFrames (1000+ rows), where vectorized operations show their greatest advantage over element-wise Python loops.
The correlation computation logic and handling of edge cases (NaNs, zero variance) remain identical, ensuring full behavioral compatibility.1 parent e776522 commit d5f16e8
1 file changed
+24
-20
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
50 | 55 | | |
51 | 56 | | |
| 57 | + | |
| 58 | + | |
52 | 59 | | |
53 | 60 | | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
62 | 66 | | |
63 | 67 | | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
71 | 78 | | |
72 | 79 | | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
| 80 | + | |
| 81 | + | |
78 | 82 | | |
79 | 83 | | |
0 commit comments