-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX512 accelerated version resulting in a 4x speed improvement over AVX2 #91
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
galoisAvx512_amd64.go
Outdated
|
||
// Construct block of matrix coefficients for 2 outputs rows in parallel | ||
func setupMatrix82(matrixRows [][]byte, inputOffset, outputOffset int, matrix *[(16 + 16) * dimIn * dimOut82]byte) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Extra newline at beginning of functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
galoisAvx512_amd64.go
Outdated
} | ||
remain := len(in[0]) - done | ||
|
||
if remain > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's just return, so we can kill some indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks for the comment. However the ~7th instruction or so starts loading the matrix coefficients into the ZMM registers so it already pretty quickly starts accessing them. I can move this instruction up but I doubt whether it would make much real world difference. |
@fwessels Thanks for the changes. I will merge them as is, since the remaining stuff is just sugar coating. I was thinking of adding a dummy call in New around here - that would start the warmup before the matrices are calculated, which will probably take more cycles than is needed. |
Yes, as a dummy call around position that could help. And thanks for the merge. |
…VX2 (klauspost#91) The performance on AVX512 has been accelerated for Intel CPUs. This gives speedups on a per-core basis of up to 4x compared to AVX2 as can be seen in the following table: ``` $ benchcmp avx2.txt avx512.txt benchmark AVX2 MB/s AVX512 MB/s speedup BenchmarkEncode8x8x1M-72 1681.35 4125.64 2.45x BenchmarkEncode8x4x8M-72 1529.36 5507.97 3.60x BenchmarkEncode8x8x8M-72 791.16 2952.29 3.73x BenchmarkEncode8x8x32M-72 573.26 2168.61 3.78x BenchmarkEncode12x4x12M-72 1234.41 4912.37 3.98x BenchmarkEncode16x4x16M-72 1189.59 5138.01 4.32x BenchmarkEncode24x8x24M-72 690.68 2583.70 3.74x BenchmarkEncode24x8x48M-72 674.20 2643.31 3.92x ```
No description provided.