Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 24, 2025

📄 31,656% (316.56x) speedup for correlation in src/statistics/descriptive.py

⏱️ Runtime : 1.39 seconds 4.38 milliseconds (best of 250 runs)

📝 Explanation and details

The optimized code achieves a remarkable 316x speedup by replacing inefficient row-by-row DataFrame access with vectorized NumPy operations.

Key optimizations:

  1. Pre-extraction of data arrays: Instead of repeatedly calling df.iloc[k][col] for each row (which is extremely slow), the code extracts all numeric columns as NumPy arrays upfront using df[col].to_numpy(). This eliminates the major bottleneck visible in the line profiler where df.iloc calls consumed 78.7% of execution time.

  2. Vectorized NaN detection: Rather than checking pd.isna() for each individual cell in nested loops, it pre-computes boolean masks using np.isnan() for entire columns, then uses logical operations (~(isnan_i | isnan_j)) to find valid row pairs.

  3. Boolean masking for data selection: Uses NumPy's boolean indexing (arr_i[valid_mask]) to extract only the valid data points for each column pair, eliminating the need to build Python lists element by element.

  4. Batch statistical calculations: All statistical computations (mean, variance, covariance) now use np.sum() on arrays instead of Python's sum() on lists, leveraging NumPy's optimized C implementations.

The line profiler shows the original code spent most time in DataFrame access operations, while the optimized version spreads computation more evenly across NumPy operations. This optimization is particularly effective for the test cases involving large DataFrames (1000+ rows), where vectorized operations show their greatest advantage over element-wise Python loops.

The correlation computation logic and handling of edge cases (NaNs, zero variance) remain identical, ensuring full behavioral compatibility.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 39 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import math

import pandas as pd

# imports
import pytest  # used for our unit tests
from src.statistics.descriptive import correlation

# unit tests

# --- Basic Test Cases ---


def test_perfect_positive_correlation():
    # Two columns perfectly correlated (y = x)
    df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 3, 4]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_perfect_negative_correlation():
    # y = -x
    df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [-1, -2, -3, -4]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_zero_correlation():
    # x and y are uncorrelated
    df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [4, 3, 2, 1]})
    # Actually, this is perfect negative correlation, let's do a real zero-correlation
    df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [0, 0, 0, 0]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_no_correlation_random():
    # x and y are independent
    df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [10, -10, 10, -10]})
    codeflash_output = correlation(df)
    result = codeflash_output
    # mean of b is 0, so covariance with a is 0
    # a: mean=2.5, b: mean=0
    # cov = sum((a_i-2.5)*(b_i-0))/4 = (1-2.5)*10 + (2-2.5)*-10 + (3-2.5)*10 + (4-2.5)*-10
    # = -1.5*10 + -0.5*-10 + 0.5*10 + 1.5*-10 = -15 + 5 + 5 - 15 = -20
    # But let's check stds:
    # std_a = sqrt(mean((a_i-2.5)^2)) = sqrt((2.25+0.25+0.25+2.25)/4) = sqrt(5/4) = sqrt(1.25) ~1.1180
    # std_b = sqrt(mean((b_i-0)^2)) = sqrt((100+100+100+100)/4) = sqrt(400/4) = sqrt(100) = 10
    # corr = -20/(1.1180*10) = -20/11.180 = -1.7888
    # Wait, that's not possible, correlation must be in [-1,1]. Let's check calculation.
    # Actually, this is a negative correlation. Let's use a better random example:
    df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 7, 6, 8]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_non_numeric_columns_ignored():
    # Only numeric columns are considered
    df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [4, 3, 2, 1], "c": ["x", "y", "z", "w"]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_single_row():
    # Only one row: std is zero, so all correlations should be nan
    df = pd.DataFrame({"a": [1], "b": [2]})
    codeflash_output = correlation(df)
    result = codeflash_output
    for k in result:
        pass


def test_single_column():
    # Only one numeric column
    df = pd.DataFrame({"a": [1, 2, 3, 4]})
    codeflash_output = correlation(df)
    result = codeflash_output


# --- Edge Test Cases ---


def test_empty_dataframe():
    # No columns, no rows
    df = pd.DataFrame()
    codeflash_output = correlation(df)
    result = codeflash_output


def test_all_nan_column():
    # One column is all NaN
    df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [float("nan")] * 4})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_some_nan_values():
    # Some values are NaN, should ignore those rows in pairwise calculation
    df = pd.DataFrame({"a": [1, 2, float("nan"), 4], "b": [1, float("nan"), 3, 4]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_column_with_zero_variance():
    # One column is constant
    df = pd.DataFrame({"a": [1, 1, 1, 1], "b": [2, 3, 4, 5]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_mixed_types():
    # Columns with mixed types (should only use numeric)
    df = pd.DataFrame(
        {"a": [1, 2, 3], "b": [4, 5, 6], "c": [True, False, True], "d": ["x", "y", "z"]}
    )
    codeflash_output = correlation(df)
    result = codeflash_output


def test_non_overlapping_nans():
    # Each column has NaNs, but never on the same row, so no overlap
    df = pd.DataFrame({"a": [1, 2, float("nan")], "b": [float("nan"), 2, 3]})
    codeflash_output = correlation(df)
    result = codeflash_output
    # Only row 1 is valid for both
    # So only one value, so std=0, so all correlations nan
    for k in result:
        pass


# --- Large Scale Test Cases ---


def test_large_dataframe_perfect_corr():
    # 1000 rows, two columns, perfect correlation
    N = 1000
    df = pd.DataFrame({"a": list(range(N)), "b": list(range(N))})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_large_dataframe_negative_corr():
    # 1000 rows, two columns, perfect negative correlation
    N = 1000
    df = pd.DataFrame({"a": list(range(N)), "b": list(range(N - 1, -1, -1))})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_large_dataframe_random_corr():
    # 1000 rows, two columns, random data
    import random

    random.seed(42)
    N = 1000
    a = [random.random() for _ in range(N)]
    b = [random.random() for _ in range(N)]
    df = pd.DataFrame({"a": a, "b": b})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_large_dataframe_some_nans():
    # 1000 rows, some NaNs in both columns
    import random

    random.seed(0)
    N = 1000
    a = [random.random() if i % 10 != 0 else float("nan") for i in range(N)]
    b = [random.random() if i % 15 != 0 else float("nan") for i in range(N)]
    df = pd.DataFrame({"a": a, "b": b})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_large_dataframe_all_nans():
    # 1000 rows, all NaNs in one column
    N = 1000
    df = pd.DataFrame({"a": list(range(N)), "b": [float("nan")] * N})
    codeflash_output = correlation(df)
    result = codeflash_output


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import math

import pandas as pd

# imports
import pytest  # used for our unit tests
from src.statistics.descriptive import correlation

# unit tests

# --------- BASIC TEST CASES ----------


def test_perfect_positive_correlation():
    # Two columns, perfectly correlated
    df = pd.DataFrame({"a": [1, 2, 3], "b": [2, 4, 6]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_perfect_negative_correlation():
    # Two columns, perfectly negatively correlated
    df = pd.DataFrame({"a": [1, 2, 3], "b": [6, 4, 2]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_zero_correlation():
    # Two columns, no correlation
    df = pd.DataFrame({"a": [1, 2, 3], "b": [0, 0, 0]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_partial_correlation():
    # Not perfect correlation
    df = pd.DataFrame({"x": [1, 2, 3, 4], "y": [2, 1, 4, 3]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_non_numeric_columns_ignored():
    # Only numeric columns should be considered
    df = pd.DataFrame({"a": [1, 2, 3], "b": [2, 4, 6], "c": ["x", "y", "z"]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_nan_ignored_in_pairwise():
    # NaNs should be ignored pairwise
    df = pd.DataFrame({"a": [1, 2, None, 4], "b": [2, None, 6, 8]})
    codeflash_output = correlation(df)
    result = codeflash_output
    # Only rows 0 and 3 are valid for both
    expected_corr = 1.0  # (1,2) and (4,8) are perfectly correlated


# --------- EDGE TEST CASES ----------


def test_empty_dataframe():
    # No columns, no rows
    df = pd.DataFrame()
    codeflash_output = correlation(df)
    result = codeflash_output


def test_one_row():
    # Only one row: variance is zero, so correlation should be nan
    df = pd.DataFrame({"a": [1], "b": [2]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_one_column():
    # Only one column, should only return self-correlation
    df = pd.DataFrame({"a": [1, 2, 3]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_all_nan_column():
    # A column with all NaN should yield nan correlations
    df = pd.DataFrame({"a": [1, 2, 3], "b": [None, None, None]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_no_numeric_columns():
    # All columns are non-numeric
    df = pd.DataFrame({"x": ["a", "b"], "y": ["c", "d"]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_partial_nan_overlap():
    # NaNs such that for some pairs, no overlap exists
    df = pd.DataFrame({"a": [1, None], "b": [None, 2]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_zero_variance():
    # All values in a column are the same, so std=0, correlation is nan
    df = pd.DataFrame({"a": [5, 5, 5], "b": [1, 2, 3]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_integer_and_float_types():
    # Columns with int and float types
    df = pd.DataFrame({"a": [1, 2, 3], "b": [1.0, 2.0, 3.0]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_column_with_inf():
    # Columns with inf values should be handled
    df = pd.DataFrame({"a": [1, 2, float("inf")], "b": [2, 4, 6]})
    codeflash_output = correlation(df)
    result = codeflash_output


# --------- LARGE SCALE TEST CASES ----------


def test_large_perfect_correlation():
    # Large DataFrame, two perfectly correlated columns
    n = 1000
    df = pd.DataFrame({"a": list(range(n)), "b": [2 * x for x in range(n)]})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_large_random_data():
    # Large DataFrame, random data, correlation should be close to 0
    import random

    random.seed(42)
    n = 1000
    a = [random.uniform(-100, 100) for _ in range(n)]
    b = [random.uniform(-100, 100) for _ in range(n)]
    df = pd.DataFrame({"a": a, "b": b})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_large_all_nan_column():
    # Large DataFrame, one column all NaN
    n = 1000
    df = pd.DataFrame({"a": list(range(n)), "b": [float("nan")] * n})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_large_sparse_overlap():
    # Large DataFrame, only a few overlapping non-NaN values
    n = 1000
    a = [1 if i == 0 else float("nan") for i in range(n)]
    b = [2 if i == 0 else float("nan") for i in range(n)]
    df = pd.DataFrame({"a": a, "b": b})
    codeflash_output = correlation(df)
    result = codeflash_output


def test_large_mixed_types():
    # Large DataFrame with int, float, and NaN
    n = 1000
    df = pd.DataFrame(
        {
            "a": list(range(n)),
            "b": [float(x) if x % 2 == 0 else float("nan") for x in range(n)],
            "c": [float("nan") if x % 2 == 0 else x for x in range(n)],
        }
    )
    codeflash_output = correlation(df)
    result = codeflash_output


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-correlation-midsvju6 and push.

Codeflash Static Badge

The optimized code achieves a remarkable **316x speedup** by replacing inefficient row-by-row DataFrame access with vectorized NumPy operations. 

**Key optimizations:**

1. **Pre-extraction of data arrays**: Instead of repeatedly calling `df.iloc[k][col]` for each row (which is extremely slow), the code extracts all numeric columns as NumPy arrays upfront using `df[col].to_numpy()`. This eliminates the major bottleneck visible in the line profiler where `df.iloc` calls consumed 78.7% of execution time.

2. **Vectorized NaN detection**: Rather than checking `pd.isna()` for each individual cell in nested loops, it pre-computes boolean masks using `np.isnan()` for entire columns, then uses logical operations (`~(isnan_i | isnan_j)`) to find valid row pairs.

3. **Boolean masking for data selection**: Uses NumPy's boolean indexing (`arr_i[valid_mask]`) to extract only the valid data points for each column pair, eliminating the need to build Python lists element by element.

4. **Batch statistical calculations**: All statistical computations (mean, variance, covariance) now use `np.sum()` on arrays instead of Python's `sum()` on lists, leveraging NumPy's optimized C implementations.

The line profiler shows the original code spent most time in DataFrame access operations, while the optimized version spreads computation more evenly across NumPy operations. This optimization is particularly effective for the test cases involving large DataFrames (1000+ rows), where vectorized operations show their greatest advantage over element-wise Python loops.

The correlation computation logic and handling of edge cases (NaNs, zero variance) remain identical, ensuring full behavioral compatibility.
@codeflash-ai codeflash-ai bot requested a review from KRRT7 November 24, 2025 23:51
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant