Skip to content

Conversation

@github-actions
Copy link
Contributor

Summary

This PR adds comprehensive benchmarking coverage for matrix operations as part of Phase 1 (Quick Wins) of the performance improvement plan. This establishes baseline performance metrics for all core matrix operations.

Performance Goal

Goal Selected: Add comprehensive matrix operation benchmarks (Phase 1, Priority: HIGH)

Rationale: The research plan identified that while vector operations had benchmarks, matrix operations had no benchmarking coverage. This PR fills that critical gap by adding 14 comprehensive benchmarks covering:

  • Element-wise operations
  • Scalar operations
  • Matrix multiplication
  • Matrix-vector operations
  • Transpose
  • Row/column access
  • Broadcast operations

Changes Made

New Benchmarks Added

All benchmarks test three matrix sizes (10x10, 50x50, 100x100) and use MemoryDiagnoser to track allocations.

Element-wise Operations:

  1. ElementWiseAdd - SIMD-accelerated element-wise addition
  2. ElementWiseSubtract - SIMD-accelerated element-wise subtraction
  3. ElementWiseMultiply - SIMD-accelerated Hadamard product
  4. ElementWiseDivide - SIMD-accelerated element-wise division

Scalar Operations:
5. ScalarAdd - Add scalar to all matrix elements
6. ScalarMultiply - Multiply all matrix elements by scalar

Matrix Multiplication:
7. MatrixMultiply - Standard matrix-matrix multiplication (matmul)

Matrix-Vector Operations:
8. MatrixVectorMultiply - Matrix × vector (SIMD-optimized)
9. VectorMatrixMultiply - Row vector × matrix (SIMD-optimized)

Structure Operations:
10. Transpose - Block-based transpose (16x16 blocks)

Access Patterns:
11. GetRow - Extract a single row (contiguous memory)
12. GetCol - Extract a single column (strided access)

Broadcast Operations:
13. AddRowVector - Add row vector to all matrix rows (SIMD)
14. AddColVector - Add column vector to all matrix columns (SIMD)

Files Modified

  • benchmarks/FsMath.Benchmarks/Matrix.fs - New benchmark class
  • benchmarks/FsMath.Benchmarks/FsMath.Benchmarks.fsproj - Added Matrix.fs to compilation
  • benchmarks/FsMath.Benchmarks/Program.fs - Registered MatrixBenchmarks class

Approach

  1. ✅ Analyzed existing Matrix operations in src/FsMath/Matrix.fs
  2. ✅ Identified all public matrix operations to benchmark
  3. ✅ Created comprehensive benchmark suite following VectorBenchmarks pattern
  4. ✅ Used appropriate sizes (10, 50, 100) to capture scaling behavior
  5. ✅ Verified compilation and benchmark discovery
  6. ✅ Ran complete benchmark suite with --job short
  7. ✅ Collected and analyzed baseline performance metrics

Performance Measurements

Test Environment

  • Platform: Linux Ubuntu 24.04.3 LTS (virtualized)
  • CPU: AMD EPYC 7763, 2 physical cores (4 logical) with AVX2 SIMD
  • Runtime: .NET 8.0.20 with hardware intrinsics (AVX2, AES, BMI1, BMI2, FMA, LZCNT, PCLMUL, POPCNT)
  • Job: ShortRun (3 warmup, 3 iterations, 1 launch)

Results Summary by Operation Type

Element-wise Operations (10x10)

All element-wise operations show excellent SIMD performance with ~70ns latency:

  • Add: 71.3 ns, 856 B allocated
  • Subtract: 70.1 ns, 856 B allocated
  • Multiply: 70.5 ns, 856 B allocated
  • Divide: 77.1 ns, 856 B allocated (slightly slower due to division complexity)

Scalar Operations (10x10)

Scalar operations are slightly faster than element-wise:

  • Add: 64.4 ns, 856 B
  • Multiply: 63.3 ns, 856 B

Matrix Multiplication Scaling

Shows expected O(n³) scaling:

  • 10×10: 725 ns (1.9 KB)
  • 50×50: 32.4 μs (40.6 KB)
  • 100×100: 224 μs (160.9 KB)

Matrix-Vector Operations (100x100)

  • Matrix × vector: 1,994 ns (824 B) - O(n²)
  • Vector × matrix: 9,208 ns (824 B) - slower due to column access

Access Pattern Comparison (100x100)

  • GetRow: 47.9 ns - very fast (contiguous memory)
  • GetCol: 105.9 ns - 2.2× slower (strided access)

Detailed Results Table

Operation 10x10 50x50 100x100
Element-wise Add 71.3 ns 1,437 ns 5,052 ns
Element-wise Subtract 70.1 ns 1,388 ns 4,991 ns
Element-wise Multiply 70.5 ns 1,433 ns 5,009 ns
Element-wise Divide 77.1 ns 1,560 ns 5,943 ns
Scalar Add 64.4 ns 1,185 ns 4,222 ns
Scalar Multiply 63.3 ns 1,174 ns 4,407 ns
Matrix Multiply 725 ns 32.4 μs 224 μs
Matrix × Vector 57.2 ns 558 ns 1,994 ns
Vector × Matrix 84.3 ns 1,958 ns 9,208 ns
Transpose 195 ns 4,103 ns 12,617 ns
GetRow 12.6 ns 28.3 ns 47.9 ns
GetCol 16.1 ns 56.6 ns 105.9 ns
Add Row Vector 96.6 ns 1,301 ns 4,452 ns
Add Col Vector 92.7 ns 1,229 ns 4,098 ns

Key Observations

  1. SIMD Effectiveness: Element-wise operations show excellent SIMD utilization with minimal overhead
  2. Linear Scaling: Most operations scale linearly with matrix size (O(n²) for n×n matrices)
  3. Memory Layout Impact: Row access is ~2× faster than column access due to row-major storage
  4. MatMul Performance: Matrix multiplication shows expected cubic scaling; could be optimized in Phase 2 with blocked GEMM
  5. Vector × Matrix Asymmetry: Vector × matrix is 4-5× slower than matrix × vector due to column access patterns
  6. Allocation Patterns: All operations allocate exactly what's needed for output (no excess allocations)

Performance Bottlenecks Identified

From these benchmarks, we can identify Phase 2 optimization opportunities:

  1. Matrix Multiplication (100×100: 224 μs) - Candidate for blocked/tiled GEMM algorithm
  2. Vector × Matrix (100×100: 9.2 μs) - Could benefit from transpose optimization or gather/scatter patterns
  3. GetCol (100×100: 106 ns) - Column extraction could use SIMD gather operations
  4. Transpose (100×100: 12.6 μs) - Already uses 16×16 blocking, but could be tuned

Replicating the Performance Measurements

To replicate these benchmarks:

# 1. Build the project
./build.sh

# 2. Run matrix benchmarks with short job (~5 minutes)
dotnet run -c Release --project benchmarks/FsMath.Benchmarks/FsMath.Benchmarks.fsproj -- --filter "*MatrixBenchmarks*" --job short

# 3. For more accurate measurements, run with default settings (~20-30 minutes)
dotnet run -c Release --project benchmarks/FsMath.Benchmarks/FsMath.Benchmarks.fsproj -- --filter "*MatrixBenchmarks*"

# 4. To run ALL benchmarks (vector + matrix):
dotnet run -c Release --project benchmarks/FsMath.Benchmarks/FsMath.Benchmarks.fsproj -- --job short

Results will be saved to BenchmarkDotNet.Artifacts/results/ in multiple formats (GitHub MD, HTML, CSV).

Testing

✅ All benchmarks compile successfully
✅ All 14 matrix benchmarks × 3 sizes = 42 benchmarks discovered
✅ All benchmarks execute without errors
✅ Existing tests still pass (132 tests)
✅ No performance report files included in commit

Next Steps

This PR establishes comprehensive baseline measurements for matrix operations. Based on these measurements, future work from the performance plan includes:

Phase 1 (remaining):

  • Document performance characteristics across all operations

Phase 2 (algorithmic improvements):

  1. Implement blocked/tiled matrix multiplication (expected 1.5-3× improvement for 100×100+)
  2. Optimize column operations with SIMD gather/scatter
  3. Improve vector × matrix performance (target: match matrix × vector)
  4. Tune transpose block size based on cache hierarchy

Phase 3 (advanced optimizations):

  • Add parallel options for large matrix operations
  • Cache-aware tuning based on benchmark data
  • Specialized routines for symmetric/triangular matrices

Related Issues/Discussions

Commands Used

# Created branch
git checkout -b perf/matrix-operation-benchmarks

# Built project
./build.sh

# Listed benchmarks
dotnet run -c Release --project benchmarks/FsMath.Benchmarks/FsMath.Benchmarks.fsproj -- --list flat

# Ran benchmarks
dotnet run -c Release --project benchmarks/FsMath.Benchmarks/FsMath.Benchmarks.fsproj -- --filter "*MatrixBenchmarks*" --job short

🤖 Generated with Claude Code

AI generated by Daily Perf Improver

AI generated by Daily Perf Improver

This commit adds extensive benchmarking coverage for matrix operations
as part of Phase 1 of the performance improvement plan.

Changes:
- Add Matrix.fs benchmark file with 14 comprehensive benchmarks
- Benchmark element-wise operations (add, subtract, multiply, divide)
- Benchmark scalar operations (add, multiply)
- Benchmark matrix multiplication (matmul)
- Benchmark matrix-vector operations (both directions)
- Benchmark transpose operation
- Benchmark row/column access patterns
- Benchmark broadcast operations (addRowVector, addColVector)
- Test with sizes: 10x10, 50x50, 100x100

Benchmarks use BenchmarkDotNet with MemoryDiagnoser to track allocations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@dsyme dsyme closed this Oct 11, 2025
@dsyme dsyme reopened this Oct 11, 2025
github-actions bot added a commit that referenced this pull request Oct 12, 2025
This commit significantly improves the performance of row vector × matrix
multiplication by reorganizing the computation to exploit row-major storage
and SIMD acceleration.

## Key Changes

- Rewrote `Matrix.multiplyRowVector` to use weighted sum of matrix rows
- Original: column-wise accumulation with strided memory access
- Optimized: row-wise accumulation with contiguous memory and SIMD

## Performance Improvements

Compared to baseline (from PR #20):

| Size    | Before    | After     | Improvement |
|---------|-----------|-----------|-------------|
| 10×10   | 84.3 ns   | 55.2 ns   | 34.5% faster |
| 50×50   | 1,958 ns  | 622.6 ns  | 68.2% faster |
| 100×100 | 9,208 ns  | 1,905 ns  | 79.3% faster |

The optimization achieves 3.5-4.8× speedup for larger matrices by:
1. Eliminating strided column access patterns
2. Enabling SIMD vectorization on contiguous row data
3. Broadcasting vector weights efficiently across SIMD lanes
4. Skipping zero weights to reduce unnecessary computation

## Implementation Details

The new implementation computes: result = v[0]*row0 + v[1]*row1 + ... + v[n-1]*row(n-1)

This approach:
- Accesses matrix rows contiguously (cache-friendly)
- Broadcasts each weight v[i] to all SIMD lanes
- Accumulates weighted rows directly into the result vector
- Falls back to original scalar implementation for small matrices

## Testing

- All 132 existing tests pass
- Benchmark infrastructure added (Matrix.fs benchmarks)
- Memory allocations unchanged

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@dsyme dsyme marked this pull request as ready for review October 12, 2025 12:56
@dsyme dsyme merged commit 7dcbf9b into main Oct 12, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants