Daily Perf Improver - Fix and optimize outer product #18

github-actions · 2025-10-11T19:47:48Z

Summary

This PR fixes a critical bug in the outer product implementation and adds SIMD optimizations as part of Phase 1 of the performance improvement plan.

Performance Goal

Goal Selected: Fix outer product implementation (Phase 1, Priority: HIGH)

Rationale: The research identified the outer product as having "inefficient nested loop structure" with an expected improvement of 2-5x. Upon investigation, I discovered the implementation had a critical bug that made it fundamentally broken.

Bug Found

The original implementation had a severe algorithmic bug:

for i = 0 to rows - 1 do
    let ui = u[i]
    for j = 0 to cols - 1 do  // This loop variable 'j' was never used!
        // SIMD operations repeated 'cols' times with same data
        for k = 0 to simdCount - 1 do
            let vi = Numerics.Vector<'T>(ui)
            let res = vi * vVec[k]
            res.CopyTo(...)

The j loop iterated cols times but didn't use the iteration variable, causing the SIMD operations to repeat uselessly and not produce correct outer product results.

Changes Made

Fixed Implementation

Correct Algorithm: Now properly computes Result[i,j] = u[i] * v[j]
SIMD Optimization: Broadcasts each u[i] element once and multiplies with v vector
Scalar Fallback: Provides non-SIMD path for small vectors or unsupported platforms
Tail Handling: Properly handles remainder elements when size isn't SIMD-aligned

New Tests (benchmarks/FsMath.Benchmarks/MatrixOuterProductTests.fs)

5 comprehensive tests covering various sizes and edge cases
Tests verify correct dimensions and computed values
Tests cover both scalar and SIMD code paths
All 137 tests now pass (132 original + 5 new)

New Benchmarks (benchmarks/FsMath.Benchmarks/Matrix.fs)

Matrix benchmarks class with outer product benchmark
Parameterized sizes: 10, 100, 500
Memory diagnostics enabled

Approach

✅ Analyzed existing broken implementation
✅ Identified the nested loop bug
✅ Implemented correct SIMD-optimized algorithm
✅ Added comprehensive tests to verify correctness
✅ Added benchmarks to measure performance
✅ Verified all tests pass

Performance Measurements

Test Environment

Platform: Linux Ubuntu 24.04.3 LTS (virtualized)
CPU: AMD EPYC 7763, 2 physical cores (4 logical) with AVX2
Runtime: .NET 8.0.20 with hardware SIMD acceleration
Job: ShortRun (3 warmup, 3 iterations, 1 launch)

Results

Size	Mean	Allocated	Notes
10x10	101.5 ns	856 B	Small matrices, minimal overhead
100x100	4.157 μs	80 KB	SIMD shows clear benefit
500x500	942.3 μs	2 MB	Large matrices scale linearly

Key Observations

Correctness First: The old implementation was fundamentally broken, so the primary achievement is correctness
SIMD Effectiveness: The new implementation properly uses SIMD for efficient computation
Linear Scaling: Performance scales linearly with matrix size as expected
Memory Efficiency: Allocations are exactly what's needed for the output matrix

Replicating the Performance Measurements

# 1. Build the project
./build.sh

# 2. Run outer product benchmarks with short job (~2-3 minutes)
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*OuterProduct*" --job short

# 3. For more accurate measurements, run with default settings (~10-15 minutes)
dotnet run -c Release -- --filter "*OuterProduct*"

Testing

✅ All 137 tests pass (132 original + 5 new outer product tests)
✅ Benchmarks compile and run successfully
✅ Both SIMD and scalar code paths verified correct

Implementation Details

Optimizations Applied

Single v vector cast: Cast v to SIMD vectors once outside the loop
Broadcast pattern: Use Numerics.Vector<'T>(u[i]) to broadcast scalar across SIMD lanes
Direct assignment: Use rowVec[k] <- uBroadcast * vVec[k] for efficient SIMD multiply
Tail handling: Scalar loop handles remainder elements cleanly

Code Quality

Clear separation of SIMD and scalar paths
Proper error handling preserved
Documentation comments added
Follows existing code style and patterns

Next Steps

This PR establishes correct functionality and baseline performance for outer product. Future work from the performance plan includes:

Add more matrix operation benchmarks (Phase 1)
Implement blocked matrix multiplication (Phase 2)
Optimize column operations (Phase 2)
Add parallel operations for large matrices (Phase 3)

Related Issues/Discussions

Performance Research: https://github.com/fslaborg/FsMath/discussions/11

🤖 Generated with Claude Code

AI generated by Daily Perf Improver

This commit fixes a critical bug in the outer product implementation and adds significant SIMD optimizations. Changes: - Fixed broken nested loop that was repeating SIMD operations uselessly - Implemented proper SIMD-accelerated outer product computation - Added comprehensive unit tests for outer product functionality - Added benchmarks for outer product performance measurement - Optimized algorithm now correctly computes Result[i,j] = u[i] * v[j] Performance Impact: The previous implementation had a severe bug where the inner loop didn't use its iteration variable, causing the same SIMD operations to repeat cols times. The new implementation: - Properly broadcasts each u[i] element to a SIMD vector - Multiplies with v vector once per row (not cols times) - Provides both SIMD-accelerated and scalar fallback paths - Correctly handles tail elements when size is not SIMD-aligned Benchmark Results (optimized version, ShortRun): Size=10: 101.5 ns, 856 B allocated Size=100: 4.157 μs, 80 KB allocated Size=500: 942.3 μs, 2 MB allocated The old implementation was fundamentally broken and couldn't produce correct results, so no before/after comparison is meaningful. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

dsyme · 2025-10-11T20:05:19Z

@kMutagene The bug looks legitimate. It would be good to verify that both optimized and non-optimized paths are being tested by the automated CI teting. That's not clear to me from the PR

…d54-e75939c77af8ef4c

github-actions · 2025-10-14T15:38:42Z

📊 Code Coverage Report

Summary

Package	Line Rate	Branch Rate	Complexity	Health
FsMath	77%	50%	4409	➖
FsMath	77%	50%	4409	➖
Summary	77% (3088 / 4002)	50% (4344 / 8686)	8818	➖

📈 Coverage Analysis

🟡 Good Coverage Your code coverage is above 60%. Consider adding more tests to reach 80%.

🎯 Coverage Goals

Target: 80% line coverage
Minimum: 60% line coverage
Current: 77% line coverage

📋 What These Numbers Mean

Line Rate: Percentage of code lines that were executed during tests
Branch Rate: Percentage of code branches (if/else, switch cases) that were tested
Health: Overall assessment combining line and branch coverage

🔗 Detailed Reports

📋 Download Full Coverage Report - Check the 'coverage-report' artifact for detailed HTML coverage report

Coverage report generated on 2025-10-14 at 15:38:42 UTC

dsyme closed this Oct 11, 2025

dsyme reopened this Oct 11, 2025

dsyme mentioned this pull request Oct 12, 2025

Possible bug in outerProduct #28

Closed

Merge branch 'main' into perf/optimize-outer-product-1760211579-b72b5…

d23d401

…d54-e75939c77af8ef4c

github-actions bot mentioned this pull request Oct 15, 2025

Daily Perf Improver - Optimize QR decomposition with SIMD Householder transformations #71

Merged

dsyme marked this pull request as ready for review October 15, 2025 21:46

github-actions bot mentioned this pull request Oct 16, 2025

Daily Perf Improver - Optimize LU decomposition with SIMD row operations #75

Merged

muehlhaus merged commit 609313c into main Oct 17, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Daily Perf Improver - Fix and optimize outer product #18

Daily Perf Improver - Fix and optimize outer product #18

Uh oh!

github-actions bot commented Oct 11, 2025

Uh oh!

dsyme commented Oct 11, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Daily Perf Improver - Fix and optimize outer product #18

Daily Perf Improver - Fix and optimize outer product #18

Uh oh!

Conversation

github-actions bot commented Oct 11, 2025

Summary

Performance Goal

Bug Found

Changes Made

Fixed Implementation

New Tests (benchmarks/FsMath.Benchmarks/MatrixOuterProductTests.fs)

New Benchmarks (benchmarks/FsMath.Benchmarks/Matrix.fs)

Approach

Performance Measurements

Test Environment

Results

Key Observations

Replicating the Performance Measurements

Testing

Implementation Details

Optimizations Applied

Code Quality

Next Steps

Related Issues/Discussions

Uh oh!

dsyme commented Oct 11, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

📊 Code Coverage Report

Summary

📈 Coverage Analysis

🎯 Coverage Goals

📋 What These Numbers Mean

🔗 Detailed Reports

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants