Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX512 accelerated version resulting in a 4x speed improvement over AVX2 #91

Merged
merged 3 commits into from
Feb 10, 2019

Conversation

fwessels
Copy link
Contributor

@fwessels fwessels commented Feb 8, 2019

No description provided.

Copy link
Owner

@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive speed! No big problems, just some small tweaks.

I just came across this:

avx512-warmup

I guess it couldn't hurt to insert a dummy instruction when 'New' is called to warm up the registers. It will not be measurable in benchmarks since first run will do the warmup, but could affect real world use.

galoisAvx512_amd64.go Outdated Show resolved Hide resolved
galoisAvx512_amd64.go Outdated Show resolved Hide resolved

// Construct block of matrix coefficients for 2 outputs rows in parallel
func setupMatrix82(matrixRows [][]byte, inputOffset, outputOffset int, matrix *[(16 + 16) * dimIn * dimOut82]byte) {

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Extra newline at beginning of functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

galoisAvx512_amd64.go Outdated Show resolved Hide resolved
galoisAvx512_amd64.go Outdated Show resolved Hide resolved
galoisAvx512_amd64.go Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
}
remain := len(in[0]) - done

if remain > 0 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's just return, so we can kill some indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

galois_amd64.go Outdated Show resolved Hide resolved
@fwessels
Copy link
Contributor Author

Impressive speed! No big problems, just some small tweaks.

I just came across this:

avx512-warmup

I guess it couldn't hurt to insert a dummy instruction when 'New' is called to warm up the registers. It will not be measurable in benchmarks since first run will do the warmup, but could affect real world use.

Thanks for the comment. However the ~7th instruction or so starts loading the matrix coefficients into the ZMM registers so it already pretty quickly starts accessing them. I can move this instruction up but I doubt whether it would make much real world difference.

@klauspost
Copy link
Owner

@fwessels Thanks for the changes. I will merge them as is, since the remaining stuff is just sugar coating.

I was thinking of adding a dummy call in New around here - that would start the warmup before the matrices are calculated, which will probably take more cycles than is needed.

@klauspost klauspost merged commit 79aee05 into klauspost:master Feb 10, 2019
@fwessels
Copy link
Contributor Author

I was thinking of adding a dummy call in New around here - that would start the warmup before the matrices are calculated, which will probably take more cycles than is needed.

Yes, as a dummy call around position that could help. And thanks for the merge.

dssysolyatin pushed a commit to GetObok/reedsolomon that referenced this pull request Jun 20, 2019
…VX2 (klauspost#91)

The performance on AVX512 has been accelerated for Intel CPUs. This gives speedups on a per-core basis of up to 4x compared to AVX2 as can be seen in the following table:

```
$ benchcmp avx2.txt avx512.txt
benchmark                      AVX2 MB/s    AVX512 MB/s   speedup
BenchmarkEncode8x8x1M-72       1681.35      4125.64       2.45x
BenchmarkEncode8x4x8M-72       1529.36      5507.97       3.60x
BenchmarkEncode8x8x8M-72        791.16      2952.29       3.73x
BenchmarkEncode8x8x32M-72       573.26      2168.61       3.78x
BenchmarkEncode12x4x12M-72     1234.41      4912.37       3.98x
BenchmarkEncode16x4x16M-72     1189.59      5138.01       4.32x
BenchmarkEncode24x8x24M-72      690.68      2583.70       3.74x
BenchmarkEncode24x8x48M-72      674.20      2643.31       3.92x
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants