-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance on amd 7950x ... #6
Comments
I even tried to use aocc compiler since support for zen4 is limited in gcc-12 but I ended with similar results.
|
Zen 4 based CPUs perform poorly because AMD's implementation of I've also run the benchmark on my 7700X CPU and also get extremely poor results for Zen 4. Thankfully, replacing calls to |
Thanks for the explanation. It is good that the instruction can be emulated and results can be similar to intel speedup. |
The answer for slow performance of AVX512 version of x86-simd-sort on Zen 4 is most probably explained in AMD manuals which could be found at: https://www.amd.com/en/search/documentation/hub.html#q=software%20optimization%20guide%20for%20the%20amd%20microarchitecture&f-amd_document_type=Software%20Optimization%20Guides Software Optimization Guide for the AMD Zen4 Microarchitecture has following remark in "2.11.2 Code recommendations" chapter:
Software Optimization Guide for the AMD Zen5 Microarchitecture doesn't have any remark about COMPRESS instructions. Could you add some code that disables the AVX512 version on Zen4, but keeps it enabled on Zen5 and future Zen architectures? |
@tarsa Will this fix work for you? You would need to compile this library with a macro enabled. |
@r-devulap I'm thinking more about adding Zen 4 AVX-512 exclusion to runtime dispatch, i.e. probably here: https://github.com/intel/x86-simd-sort/blob/59e298d8c9d1bee2cded744b9adbe31107ee220c/lib/x86simdsort.cpp In short the idea is: if CPU has AVX-512, but is Zen 4, then use AVX2. Otherwise use AVX-512 on Zen 5 and future ones. That would have less negative performance consequences than solutions like openjdk/jdk#16124
|
But AVX512 is faster?? Why would you use AVX2? |
The patch from https://web.archive.org/web/20230319232625/https://github.com/natmaurice/x86-simd-sort/commit/41d03b2d8f3b62a2ee6a3a97a8da7f193a407026 is actually incomplete. It improves things if we only care about Zen 4, but the goal is to have single library that is well optimized for wide range of popular CPUs. If you define SW_VCOMPRESS then you gain performance on Zen 4, but lose performance on Zen 5 and other AVX-512 capable CPUs. If you don't define SW_VCOMPRESS then you get situation from this thread's original post. Therefore, to get the most optimized version in all cases, the code would need to be compiled twice, once with SW_VCOMPRESS defined and once without, and then the dispatch would need to choose between these two copies. That would increase maintenance costs, since there would need to be 2x testing of AVX-512 versions, and only improve performance for Zen 4, which is old generation of CPUs already. I'm not sure if it's worth the effort compared to the (at least simple sounding) fix described in #6 (comment) |
@tarsa I am fine with that approach. We could start with disabling avx512 on zen4 and eventually get to enabling it with a SW_VCOMPRESS enabled. I am happy to review to a pull request if someone can work on it and provide benchmarks. I am on travel for the next 6 weeks and so I might not get to it immediately. |
Hello,
I tried benchmark on 7950x cpu and performance is in some tests up to 2.3x faster but in other tests much slower (like 0.3x) compared to classical sorting. Is amd implementation of avx512 not so powerful (and your code is not suitable for zen4) or is it something else ?
Thanks,
Jan
The text was updated successfully, but these errors were encountered: