Skip to content

Conversation

@GermanAizek
Copy link
Contributor

@GermanAizek GermanAizek commented Dec 1, 2025

This change optimizes the AVX-based sgemm (single-precision general matrix multiplication) kernel by introducing a local __m256i variable, avec, to cache the result of load(A + lda * (ii + i) + l). Previously, this memory load was redundantly performed four times for each iteration within the updot calls for Cv[0][i] through Cv[3][i].

By loading vector once and reusing it, the code eliminates these redundant memory accesses, reducing memory latency and improving instruction-level parallelism. This is a common subexpression elimination (CSE) optimization, crucial for performance in tight loops of vectorized kernels.

References:

Co-Authored-By: Gemini 2.5 Pro (References and desc commit changes)

This change optimizes the AVX-based `sgemm` (single-precision general matrix multiplication) kernel by introducing a local `__m256i` variable, `avec`, to cache the result of `load(A + lda * (ii + i) + l)`. Previously, this memory load was redundantly performed four times for each iteration within the `updot` calls for `Cv[0][i]` through `Cv[3][i]`.

By loading vector once and reusing it, the code eliminates these redundant memory accesses, reducing memory latency and improving instruction-level parallelism. This is a common subexpression elimination (CSE) optimization, crucial for performance in tight loops of vectorized kernels.

References:
*   [Common Subexpression Elimination - Wikipedia](https://en.wikipedia.org/wiki/Common_subexpression_elimination)
*   [Optimizing with Intel AVX2 - Intel Developer Zone](https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-with-intel-avx2.html)
*   [SIMD performance: data alignment and memory access - Daniel Lemire's Blog](https://lemire.me/blog/2012/05/31/simd-performance-data-alignment-and-memory-access/)
*   [Loop Optimization in Compiler Design - GeeksforGeeks](https://www.geeksforgeeks.org/loop-optimization-in-compiler-design/)
*   [Performance Optimization - CPU Caches and Memory Hierarchy - Princeton University](https://www.cs.princeton.edu/courses/archive/fall09/cos333/lectures/17_perf.pdf)
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 1, 2025
@pwilkin pwilkin added the vibe-coded Created with heavy use of LLM assistants, requires human verification label Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning vibe-coded Created with heavy use of LLM assistants, requires human verification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants