Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized longest_match for Power processors #459

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

mscastanho
Copy link

Hello again,

This optimization uses VSX vector (SIMD) instructions to try to match multiple bytes at the same time during the search for the longest match. A vector load + comparison (16 bytes) has just a small overhead if compared to their regular versions, so the optimized longest_match tries to match as many bytes as possible on every comparison.

This PR shares 1 commit with #457 and #458, which can be removed if either one gets merged first. It also uses GNU indirect functions to choose which function version (optimized or default) to run on the first call to longest_match during runtime.

To test the performance improvement, we used Chromium's zlib_bench.cc with input files from jsnell/zlib-bench.

The results below show compression throughput in MB/s using RAW deflate, for all compression levels:

  • pngpixels

    comp lvl default optimized gain
    1 67.5 73.0 +8.15%
    2 59.0 65.3 +10.68%
    3 38.8 45.2 +16.49%
    4 42.0 46.0 +9.52%
    5 26.7 31.6 +18.35%
    6 13.8 16.5 +19.57%
    7 8.9 10.6 +19.10%
    8 2.8 3.4 +21.43%
    9 1.3 1.5 +15.38%
  • jpeg

    comp lvl default optimized gain
    1 20.0 20.5 +2.50%
    2 20.2 20.3 +0.50%
    3 20.2 20.3 +0.50%
    4 20.3 20.4 +0.49%
    5 20.3 20.4 +0.49%
    6 20.3 20.4 +0.49%
    7 20.3 20.4 +0.49%
    8 19.9 20.4 +2.51%
    9 20.3 20.4 +0.49%
  • executable

    comp lvl default optimized gain
    1 41.2 43.1 +4.61%
    2 37.8 39.2 +3.70%
    3 28.9 29.9 +3.46%
    4 28.3 28.9 +2.12%
    5 20.2 21.4 +5.94%
    6 12.5 13.1 +4.80%
    7 9.5 9.9 +4.21%
    8 5.4 5.6 +3.70%
    9 4.1 4.2 +2.44%
  • html

    comp lvl default optimized gain
    1 43.0 46.2 +7.44%
    2 38.5 42.2 +9.61%
    3 27.8 30.8 +10.79%
    4 28.3 30.8 +8.83%
    5 18.1 20.1 +11.05%
    6 12.2 13.2 +8.20%
    7 10.6 11.4 +7.55%
    8 8.0 8.7 +8.75%
    9 7.9 8.6 +8.86%

@mscastanho
Copy link
Author

Force push to add changes to feature detection on configure.

Optimized functions for Power will make use of GNU indirect functions,
an extension to support different implementations of the same function,
which can be selected during runtime. This will be used to provide
optimized functions for different processor versions.

Since this is a GNU extension, we placed the definition of the Z_IFUNC
macro under `contrib/gcc`. This can be reused by other archs as well.

Author: Matheus Castanho <msc@linux.ibm.com>
Author: Rogerio Alves <rcardoso@linux.ibm.com>
* bytes where LSB == 0 is the same as counting the length of the match.
*/
#ifdef __LITTLE_ENDIAN__
asm volatile("vctzlsbb %0, %1\n\t" : "=r" (len) : "v" (vc));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assembly in both versions is identical. Is this intended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I was wrong. One letter off.

This commit introduces an optimized version of the longest_match
function for Power processors. It uses VSX instructions to match
16 bytes at a time on each comparison, instead of one by one.

Author: Matheus Castanho <msc@linux.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants