Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized crc32 for Power 8+ processors #478

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

mscastanho
Copy link

Hi @madler and zlib community,

This is an updated version of @grooverdan 's PR #335. It includes the following changes:

  • Rebase against develop branch
  • Use new ifunc infrastructure from other Power-related PRs
  • Fix warnings on crc32_test.c when compiling with -Wall
  • Support for CMake
  • Match zlib's use of unsigned long on crc32
  • Dropped clang support (see below)

Regarding clang, it is not safe to call getauxval() (and other glibc functions) from inside an ifunc resolver, so we ended up using __builtin_cpu_supports for feature detection. This builtin is not currently supported by clang on Power. When using clang, the code will compile fine, but without the optimizations. Still, some of the workarounds for clang throughout the code were left as they seem harmless for other compilers.

To measure the performance improvement, we used Chromium's zlib_bench.cc with input files from jsnell/zlib-bench.

The results below show decompression throughput in MB/s using gzip wrapper, for all compression levels:

  • pngpixels

    comp lvl default crc32-power gain
    1 290.5 375.4 +29.23%
    2 311.5 411.7 +32.17%
    3 349.5 482.3 +38.00%
    4 254.2 317.1 +24.74%
    5 273.9 349.4 +27.56%
    6 313.7 417.6 +33.12%
    7 330.8 448.3 +35.52%
    8 365.5 514.7 +40.82%
    9 383.6 551.7 +43.82%
  • jpeg

    comp lvl default crc32-power gain
    1 406.3 619.3 +52.42%
    2 361.0 520.5 +44.18%
    3 369.7 539.7 +45.98%
    4 742.4 1938.7 +161.14%
    5 740.2 1961.9 +165.05%
    6 740.3 1950.7 +163.50%
    7 737.9 1956.8 +165.18%
    8 731.2 1936.4 +164.82%
    9 743.2 1961.5 +163.93%
  • executable

    comp lvl default crc32-power gain
    1 210.4 251.7 +19.63%
    2 217.0 261.2 +20.37%
    3 224.6 272.7 +21.42%
    4 212.6 255.7 +20.27%
    5 216.9 261.8 +20.70%
    6 222.8 270.5 +21.41%
    7 224.5 272.5 +21.38%
    8 227.1 277.1 +22.02%
    9 227.7 277.4 +21.83%
  • html

    comp lvl default crc32-power gain
    1 237.1 288.9 +21.85%
    2 245.9 302.0 +22.81%
    3 256.4 318.0 +24.02%
    4 247.3 306.0 +23.74%
    5 245.5 302.9 +23.38%
    6 250.1 309.9 +23.91%
    7 250.2 309.9 +23.86%
    8 250.6 310.3 +23.82%
    9 250.4 309.9 +23.76%

mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Jun 16, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Jun 17, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Jun 18, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Jun 18, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Jun 22, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Jun 22, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Jun 24, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Jun 25, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Jul 15, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Aug 10, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Aug 10, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Aug 11, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
mscastanho added a commit to mscastanho/zlib-ng that referenced this pull request Aug 11, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).
Dead2 pushed a commit to zlib-ng/zlib-ng that referenced this pull request Aug 12, 2021
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.

Based on Daniel Black's work for the original zlib (madler/zlib#478).

Z_IFUNC(crc32_z) {
#ifdef Z_POWER8
if (__builtin_cpu_supports("arch_2_07"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed off-PR, I think this needs to check for VCRYPTO.

@ljavorsk
Copy link

ljavorsk commented Apr 6, 2022

Hi @mscastanho, we are just rebasing the new release from zlib on Fedora and RHEL and this PR is part of our downstream patches.
However, after the new release was released, it brings a lot of new conflicts.

Is there any way you are working on fixing these conflicts?

If not we are not sure we can backport this anymore due to the complicity of this PR.

Looking forward to your reply.
Thank you for your contribution :)

@mscastanho
Copy link
Author

@ljavorsk I tried a local rebase and couldn't see any conflicts. Based on the Fedora zlib repo I suspect you're using the old patch from #335. In which case I'd suggest to replace it with this PR, which has some other fixes.

Rogerio Alves and others added 3 commits April 6, 2022 09:50
Optimized functions for Power will make use of GNU indirect functions,
an extension to support different implementations of the same function,
which can be selected during runtime. This will be used to provide
optimized functions for different processor versions.

Since this is a GNU extension, we placed the definition of the Z_IFUNC
macro under `contrib/gcc`. This can be reused by other archs as well.

Author: Matheus Castanho <msc@linux.ibm.com>
Author: Rogerio Alves <rcardoso@linux.ibm.com>
This commit adds an optimized version for the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>

It makes use of vector instructions to speed up CRC32 algorithm.
Clang 7 changed the behavior of vec_xxpermdi in order to match GCC's
behavior.  After this change, code that used to work on Clang 6 stopped
to work on Clang >= 7.

Tested on Clang 6, 7, 8 and 9.

Reference: https://bugs.llvm.org/show_bug.cgi?id=38192

Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
@mscastanho
Copy link
Author

Rebased on top of current develop.

@ljavorsk
Copy link

ljavorsk commented Apr 7, 2022

Thanks @mscastanho. So you say this is the same but improved patch?
Or does it have some other differences we should know about?

Because the Fedora patch that you've linked was part of a customer request done by your IBM colleagues.
You can see it here: https://bugzilla.redhat.com/show_bug.cgi?id=1666798

@mscastanho
Copy link
Author

So you say this is the same but improved patch?

@ljavorsk Exactly. In terms of optimization, it's the same code. What changed is now it's also enabled when building with cmake, the ifunc usage was improved and there are a few other minor fixes (complete list on the PR description).

Should we open a new request to replace that with this new patch then?

@ljavorsk
Copy link

ljavorsk commented Apr 7, 2022

@mscastanho That's a great idea to open a new request, it would help us track it and we can also have some description there for future reference.

Either way, thank you for the explanation.

@ljavorsk
Copy link

Hi @mscastanho ,

Could you please also rebase the patch on top of the new zlib-1.2.13 version?

Thank you so much :)

@tuliom
Copy link

tuliom commented Oct 19, 2022

Could you please also rebase the patch on top of the new zlib-1.2.13 version?

@RajalakshmiSR, considering that @mscastanho does not have access to POWER servers anymore, could you give a hand with this, please?
This is important for Fedora and other distros.

@Neustradamus
Copy link

To follow.

@RajalakshmiSR
Copy link

@mmatti-sw Can you help to rebase?

@mmatti-sw
Copy link

I have created #750 the old patches are now rebased to v1.2.13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants