sgemm: reuse loaded vector in AVX dot product calculation #17648

GermanAizek · 2025-12-01T11:03:53Z

This change optimizes the AVX-based sgemm (single-precision general matrix multiplication) kernel by introducing a local __m256i variable, avec, to cache the result of load(A + lda * (ii + i) + l). Previously, this memory load was redundantly performed four times for each iteration within the updot calls for Cv[0][i] through Cv[3][i].

By loading vector once and reusing it, the code eliminates these redundant memory accesses, reducing memory latency and improving instruction-level parallelism. This is a common subexpression elimination (CSE) optimization, crucial for performance in tight loops of vectorized kernels.

References:

Co-Authored-By: Gemini 2.5 Pro (References and desc commit changes)

This change optimizes the AVX-based `sgemm` (single-precision general matrix multiplication) kernel by introducing a local `__m256i` variable, `avec`, to cache the result of `load(A + lda * (ii + i) + l)`. Previously, this memory load was redundantly performed four times for each iteration within the `updot` calls for `Cv[0][i]` through `Cv[3][i]`. By loading vector once and reusing it, the code eliminates these redundant memory accesses, reducing memory latency and improving instruction-level parallelism. This is a common subexpression elimination (CSE) optimization, crucial for performance in tight loops of vectorized kernels. References: * [Common Subexpression Elimination - Wikipedia](https://en.wikipedia.org/wiki/Common_subexpression_elimination) * [Optimizing with Intel AVX2 - Intel Developer Zone](https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-with-intel-avx2.html) * [SIMD performance: data alignment and memory access - Daniel Lemire's Blog](https://lemire.me/blog/2012/05/31/simd-performance-data-alignment-and-memory-access/) * [Loop Optimization in Compiler Design - GeeksforGeeks](https://www.geeksforgeeks.org/loop-optimization-in-compiler-design/) * [Performance Optimization - CPU Caches and Memory Hierarchy - Princeton University](https://www.cs.princeton.edu/courses/archive/fall09/cos333/lectures/17_perf.pdf)

GermanAizek requested a review from ggerganov as a code owner December 1, 2025 11:03

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 1, 2025

pwilkin added the vibe-coded Created with heavy use of LLM assistants, requires human verification label Dec 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sgemm: reuse loaded vector in AVX dot product calculation #17648

sgemm: reuse loaded vector in AVX dot product calculation #17648

GermanAizek commented Dec 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sgemm: reuse loaded vector in AVX dot product calculation #17648

Are you sure you want to change the base?

sgemm: reuse loaded vector in AVX dot product calculation #17648

Conversation

GermanAizek commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GermanAizek commented Dec 1, 2025 •

edited

Loading