-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimized crc32 for Power 8+ processors #478
base: develop
Are you sure you want to change the base?
Conversation
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
This commit adds an optimized version of the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ . The code has been relicensed to the zlib license. This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm. Decompression times were improved by +30% on tests. Based on Daniel Black's work for the original zlib (madler/zlib#478).
6ca5013
to
6d4c1fd
Compare
|
||
Z_IFUNC(crc32_z) { | ||
#ifdef Z_POWER8 | ||
if (__builtin_cpu_supports("arch_2_07")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed off-PR, I think this needs to check for VCRYPTO.
Hi @mscastanho, we are just rebasing the new release from zlib on Fedora and RHEL and this PR is part of our downstream patches. Is there any way you are working on fixing these conflicts? If not we are not sure we can backport this anymore due to the complicity of this PR. Looking forward to your reply. |
@ljavorsk I tried a local rebase and couldn't see any conflicts. Based on the Fedora zlib repo I suspect you're using the old patch from #335. In which case I'd suggest to replace it with this PR, which has some other fixes. |
Optimized functions for Power will make use of GNU indirect functions, an extension to support different implementations of the same function, which can be selected during runtime. This will be used to provide optimized functions for different processor versions. Since this is a GNU extension, we placed the definition of the Z_IFUNC macro under `contrib/gcc`. This can be reused by other archs as well. Author: Matheus Castanho <msc@linux.ibm.com> Author: Rogerio Alves <rcardoso@linux.ibm.com>
This commit adds an optimized version for the crc32 function based on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com> It makes use of vector instructions to speed up CRC32 algorithm.
Clang 7 changed the behavior of vec_xxpermdi in order to match GCC's behavior. After this change, code that used to work on Clang 6 stopped to work on Clang >= 7. Tested on Clang 6, 7, 8 and 9. Reference: https://bugs.llvm.org/show_bug.cgi?id=38192 Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
Rebased on top of current |
Thanks @mscastanho. So you say this is the same but improved patch? Because the Fedora patch that you've linked was part of a customer request done by your IBM colleagues. |
@ljavorsk Exactly. In terms of optimization, it's the same code. What changed is now it's also enabled when building with cmake, the ifunc usage was improved and there are a few other minor fixes (complete list on the PR description). Should we open a new request to replace that with this new patch then? |
@mscastanho That's a great idea to open a new request, it would help us track it and we can also have some description there for future reference. Either way, thank you for the explanation. |
Hi @mscastanho , Could you please also rebase the patch on top of the new zlib-1.2.13 version? Thank you so much :) |
@RajalakshmiSR, considering that @mscastanho does not have access to POWER servers anymore, could you give a hand with this, please? |
To follow. |
@mmatti-sw Can you help to rebase? |
I have created #750 the old patches are now rebased to v1.2.13. |
Hi @madler and zlib community,
This is an updated version of @grooverdan 's PR #335. It includes the following changes:
crc32_test.c
when compiling with -WallRegarding clang, it is not safe to call
getauxval()
(and other glibc functions) from inside an ifunc resolver, so we ended up using__builtin_cpu_supports
for feature detection. This builtin is not currently supported by clang on Power. When using clang, the code will compile fine, but without the optimizations. Still, some of the workarounds for clang throughout the code were left as they seem harmless for other compilers.To measure the performance improvement, we used Chromium's zlib_bench.cc with input files from jsnell/zlib-bench.
The results below show decompression throughput in MB/s using gzip wrapper, for all compression levels:
pngpixels
jpeg
executable
html