sgemm: reuse loaded vector in AVX dot product calculation #17648
+9
−12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change optimizes the AVX-based
sgemm(single-precision general matrix multiplication) kernel by introducing a local__m256ivariable,avec, to cache the result ofload(A + lda * (ii + i) + l). Previously, this memory load was redundantly performed four times for each iteration within theupdotcalls forCv[0][i]throughCv[3][i].By loading vector once and reusing it, the code eliminates these redundant memory accesses, reducing memory latency and improving instruction-level parallelism. This is a common subexpression elimination (CSE) optimization, crucial for performance in tight loops of vectorized kernels.
References:
Co-Authored-By: Gemini 2.5 Pro (References and desc commit changes)