Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized slide_hash for Power processors #457

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

mscastanho
Copy link

Hi,

During performance tests, we noticed that slide_hash consumes considerable CPU during compression on Power processors. This PR introduces an optimized version using VSX vector instructions to make it faster. The main difference is that it slides 8 elements at a time, instead of just one as the standard code does.

The implementation uses GNU indirect function (ifunc) feature to choose the correct function version to be used on the first call during runtime. Later calls will all go directly to the selected function. This way, the same binary can be used for all Power processor versions. The ifunc helper code, however, is not limited to Power, and can be reused by other archs if wanted, so it was placed under contrib/gcc.

I tried to make as few changes as possible to top-level files (deflate.c), and instead place most Power-specific code under contrib/power.

To measure the performance improvement, we used Chromium's zlib_bench.cc with input files from jsnell/zlib-bench.

The results below show compression throughput in MB/s using RAW deflate, for all compression levels:

  • jpeg

    comp lvl default optimized gain
    1 20.4 27.4 +34.31%
    2 20.2 26.4 +30.69%
    3 20.2 27.1 +34.16%
    4 20.3 27.3 +34.48%
    5 20.3 27.3 +34.48%
    6 20.3 27.3 +34.48%
    7 20.3 27.3 +34.48%
    8 20.3 27.3 +34.48%
    9 20.3 27.3 +34.48%
  • pngpixels

    comp lvl default optimized gain
    1 67.0 98.6 +47.16%
    2 58.7 79.8 +35.95%
    3 38.8 46.7 +20.36%
    4 42.1 48.8 +15.91%
    5 26.6 29.2 +9.77%
    6 13.8 14.5 +5.07%
    7 8.9 9.2 +3.37%
    8 2.8 2.8 +0.00%
    9 1.3 1.3 +0.00%
  • executable

    comp lvl default optimized gain
    1 41.3 57.6 +39.47%
    2 37.9 50.9 +34.30%
    3 29.0 36.1 +24.48%
    4 28.4 34.8 +22.54%
    5 20.2 23.2 +14.85%
    6 12.5 13.7 +9.60%
    7 9.5 10.1 +6.32%
    8 5.4 5.6 +3.70%
    9 4.1 4.2 +2.44%
  • html

    comp lvl default optimized gain
    1 43.1 59.3 +37.59%
    2 38.6 50.7 +31.35%
    3 27.8 33.8 +21.58%
    4 28.3 33.1 +16.96%
    5 18.1 20.1 +11.05%
    6 12.2 13.0 +6.56%
    7 10.6 11.2 +5.66%
    8 8.0 8.4 +5.00%
    9 7.9 8.3 +5.06%

@mscastanho
Copy link
Author

Force push to add changes to feature detection on configure.

Optimized functions for Power will make use of GNU indirect functions,
an extension to support different implementations of the same function,
which can be selected during runtime. This will be used to provide
optimized functions for different processor versions.

Since this is a GNU extension, we placed the definition of the Z_IFUNC
macro under `contrib/gcc`. This can be reused by other archs as well.

Author: Matheus Castanho <msc@linux.ibm.com>
Author: Rogerio Alves <rcardoso@linux.ibm.com>
Considerable time is spent on deflate.c:slide_hash() during
deflate. This commit introduces a new slide_hash function that
uses VSX vector instructions to slide 8 hash elements at a time,
instead of just one as the standard code does.

The choice between the optimized and default versions is made only
on the first call to the function, enabling a fallback to standard
behavior if the host processor does not support VSX instructions,
so the same binary can be used for multiple Power processor
versions.

Author: Matheus Castanho <msc@linux.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant