Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized CRC32 framework with Power crc32 c optimized function #335

Closed
wants to merge 12 commits into from
Closed

Optimized CRC32 framework with Power crc32 c optimized function #335

wants to merge 12 commits into from

Conversation

grooverdan
Copy link

@madler ,

On this year anniversary of the previous release I present some code changes to enable runtime selection of optimized crc32 code on glibc platforms, and falls back to a function pointer implementation for other platforms and static libraries. These capabilities are detected at configure time.

I have also include an optimized crc32 for IBM Power8 and newer hardware.

Other arches like in #251 should be able to leverage this framework with minimal effort.

I authorize these changes to be distributed under the zlib license.

I'm happy to make changes to meet maintainability requirements.

…hard/crc32-vpmsum/

This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>
Correct argument types and ensure that a buffer pointer
of 0 returns 0ULL consistent with existing crc32 functions.
To support runtime optimization of crc32 GNU IFUNC capabilities
allows zlib to return an optimized crc32_z function pointer that is
resolved at runtime to the existing crc32_z name and is able to be
used by existing applications.

There are two code forms in which this can be defined; a native
attribute, and; an asm type defination which will work with older
gcc compilers.

crc32_ifunc is added as a function that is called by the glibc
loader if the IFUNC capability exists to resolve the crc32_z
function. If the IFUNC capabilies don't exists it will otherwise
returns a function pointer on the first instigation of the crc32_z
function call. For staticly compiled code the function pointer
variant of this function is used.

crc32_ifunc provides a point of expansion for returning other
optimized crc32 implementations for other architectures.

DYNAMIC_CRC_TABLE/make_crc_table now occurs to the crc32_ifunc
and only if an crc32 function (crc32_big/crc32_little/
crc32_table_lookup) that use the generated table.

As a result lazy binding occurs (the default for glibc) on the
calling of make_crc_table occurs on the first use of crc32/crc32_z.
Compile time linker options, environment LD_BIND_NOW, hardened
compilers etc, will solve functions, i.e. the IFUNC, on symbold
initialisation to occur as the program is loaded rather
than on first use of crc32/crc32_z. The generation of this table
will be farely minor compared to the other non-lazy bindings
occuring.

As crc32_big/crc32_little are optimized functions these are used
as a fallback to any optimized implemented (provided NO_BYFOUR
isn't defined) these will now be called directly for a crc32/
crc32_z and as such the 'if (buf == Z_NULL) return 0UL;' needed to
be introduced to these functions.

The table lookup implementation of crc32 is moved to
crc32_table_lookup and used a function of last resort.
Power Architecture is detected in the configure script and adds
optimized code to PIC_OBJC/OBJC.

Power8 crc32 performance
------------------------

Test - decompressing a jdk binary:

Before (no optimized crc32_vpmsum (disabled in crc32_z_ifunc):

$ time ./minigzip -d -c  ../ibm-java-i386-sdk-8.0-5.0.bin.gz > ../ibm-java-i386-sdk-8.0-5.0.bin.restored

real    0m2.972s
user    0m2.292s
sys     0m0.100s

perf report -g --no-children:
   52.26%  minigzip  minigzip          [.] crc32_little
   18.86%  minigzip  minigzip          [.] inflate_fast
    4.87%  minigzip  [unknown]         [k] 0xc000000000063748
    4.87%  minigzip  libc-2.23.so      [.] __memcpy_power7
    1.56%  minigzip  minigzip          [.] inflate
    0.96%  minigzip  minigzip          [.] inflate_table
    0.95%  minigzip  libc-2.23.so      [.] _IO_fwrite
    0.61%  minigzip  minigzip          [.] inflateCodesUsed

Using crc32_vpmsum:

$ time ./minigzip -d -c  ../ibm-java-i386-sdk-8.0-5.0.bin.gz > ../ibm-java-i386-sdk-8.0-5.0.bin.restored

real    0m0.895s
user    0m0.224s
sys     0m0.092s

perf report -g --no-children:
   36.49%  minigzip  minigzip          [.] inflate_fast
   11.60%  minigzip  [unknown]         [k] 0xc000000000063748
    7.93%  minigzip  libc-2.23.so      [.] __memcpy_power7
    3.77%  minigzip  minigzip          [.] crc32_vpmsum
    3.70%  minigzip  minigzip          [.] inflate_table
    2.29%  minigzip  minigzip          [.] inflate
const unsigned char FAR *buf;
z_size_t len;
{
static unsigned long ZEXPORT (*crc32_func)(unsigned long, const unsigned char FAR *, z_size_t) = NULL;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a static plus invoking a function pointer seems potentially more expensive than testing for a bool (at least on ARM it was).

I'm talking about:
https://chromium-review.googlesource.com/c/chromium/src/+/801108/12/third_party/zlib/crc32.c

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you got a test framework you tried on this? I'm keen to make something efficient across all platforms. I'm trusting the glibc folks a lot to get IFUNC right for the shard library case as the crc32_z_ifunc return pointer becomes crc32_z in the trampolines and wouldn't mind to validate efficiencies of function pointer vs bool vs ifunc ways.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our optimization work in zlib targeted to improve PNG decoding performance in Chromium (that has both tracing tools as also an image_decoder benchmarker).

The results using a function pointer were worst in the device tested (ARMv8) for the PNG corpus we used (but remember that Chromium is built for ARM using clang).

Also Chromium has just added a compression/decompression benchmarker, it may be helpful to have a look on:
https://cs.chromium.org/chromium/src/third_party/zlib/contrib/bench/

unsigned long crc;
const unsigned char FAR *buf;
z_size_t len;
{
if (buf == Z_NULL) return 0UL;

crc = crc ^ 0xffffffffUL;
while (len >= 8) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to add some considerable amount of code in zlib's crc32 file.

I wonder if you could integrate your patch in a less intrusive way? (i.e. less is better)

I think a good target would be to try to easy the burden of maintenance of zlib.

Copy link
Author

@grooverdan grooverdan Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit is largely moving the table implementation to its own independent function. It lowers the maintenance burden by uncoupling it from the other optimized implementations.

There is also a bit of boiler plate added here to support two forms of glibc compilation and a static compilation and non-glibc (function pointer).

I could drop out the glibc ifunc however this it a bit that provides an advantage for ARM and PowerPC64 and provides a way to push out the hardware feature detection out of the main line of the crc32 calculation. The function call in a shared library is a trampoline function pointer anyway as I understand it so its keeping it to one is probably beneficial.

I've cleanly separated each commit to make it easier to review though a2f5adc doesn't look as intuitive as the real change is.

@@ -826,6 +826,91 @@ EOF
fi
fi

# test to see if Power8+ implementation is compile time possible
echo >> configure.log
Copy link

@Adenilson Adenilson Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you forgot about the CMake buildsystem?

I personally think that any new feature should be added only in the CMakeLists.txt (and consider the configure script in some kind of minimal maintenance mode for legacy systems).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure as to how up to date this was. I can add this configuration easy enough. Thanks for the advice.

static unsigned long ZEXPORT (*crc32_func)(unsigned long, const unsigned char FAR *, z_size_t) = NULL;

if (!crc32_func)
crc32_func = crc32_z_ifunc();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential issue: would this work in a multithreaded application? Seems a bit race-condition-prone.

Maybe use a pthread_lock or pthread_once for the initialization?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to validate this with a highly parallel application (i.e. Chrome) or pigz (https://github.com/madler/pigz).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question about thread safety. I put it to https://github.com/Dead2/zlib-ng/issues/147 who rely on function pointers extensively. Though they would have the equivalent of a crc32_func pointing at crc32_z_ifunc which modified the function pointer. I think I could do an approach that way which would avoid the high 32bits of crc32_func being modified, checked in another thread to be !0 and then called without the lower 32bits being set. Having said that I was unable to generate a normal ASM code that did it that way.

Does pthread_once work for initialising in static libraries? I'll try to avoid locks in the path as that seems more harmful than function pointers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grooverdan they should work as zlib is built as a static lib in Chromium.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic:

$ LD_LIBRARY_PATH=../zlib ldd    ./pigz
	linux-vdso64.so.1 =>  (0x00003fff8aaa0000)
	libm.so.6 => /lib/powerpc64le-linux-gnu/libm.so.6 (0x00003fff8a9a0000)
	libpthread.so.0 => /lib/powerpc64le-linux-gnu/libpthread.so.0 (0x00003fff8a960000)
	libz.so.1 => ../zlib/libz.so.1 (0x00003fff8a910000)
	libc.so.6 => /lib/powerpc64le-linux-gnu/libc.so.6 (0x00003fff8a740000)
	/lib64/ld64.so.2 (0x00003fff8aac0000)

Tested with:

$ LD_LIBRARY_PATH=../zlib    ./pigz  -J 4096  -c  /var/lib/libvirt/images/ceph.ceph0.img | LD_LIBRARY_PATH=../zlib-1.2.11/ ./unpigz -J 4096  > /dev/null

(where the 1.2.11 was the last release compiled version)

static compile of pigz

danielgb@p87:~/pigz$ gcc -o pigz pigz.o yarn.o try.o deflate.o blocksplitter.o tree.o lz77.o cache.o hash.o util.o squeeze.o katajainen.o -lm -lpthread    -Wl,-Bstatic -L ../zlib -lz -Wl,-Bdynamic
danielgb@p87:~/pigz$ LD_LIBRARY_PATH=../zlib ldd pigz
	linux-vdso64.so.1 =>  (0x00003fffa5c50000)
	libm.so.6 => /lib/powerpc64le-linux-gnu/libm.so.6 (0x00003fffa5b50000)
	libpthread.so.0 => /lib/powerpc64le-linux-gnu/libpthread.so.0 (0x00003fffa5b10000)
	libc.so.6 => /lib/powerpc64le-linux-gnu/libc.so.6 (0x00003fffa5940000)
	/lib64/ld64.so.2 (0x00003fffa5c70000)

tested fine on same test

zlib_bench:

(below where run on a loaded server, relative comparisons may not be sane)

dynamic:

 LD_LIBRARY_PATH=zlib ./zlib_bench_clang  zlib  /var/lib/libvirt/images/guest-be-6.img
/var/lib/libvirt/images/guest-be-6.img :
ZLIB: [b 1M] bytes 1073741824 -> 1073110  0.1% comp 100.0 (100.2) MB/s uncomp 182.4 (183.9) MB/s

static:

danielgb@p87:~$ clang++ -O3 -Wall -std=c++11 -lstdc++ -I zlib zlib_bench.cc zlib/libz.a  -o zlib_bench_clang_static
$ ./zlib_bench_clang_static  zlib  /var/lib/libvirt/images/guest-be-6.img
/var/lib/libvirt/images/guest-be-6.img :
ZLIB: [b 1M] bytes 1073741824 -> 1073110  0.1% comp 100.4 (100.5) MB/s uncomp 183.8 (184.0) MB/s

@julian-klode
Copy link

julian-klode commented Jan 29, 2018

shouldn't this be somewhere in contrib/power or something? At least that's where all the other arch-specific code seems to live.

@grooverdan
Copy link
Author

@julian-klode good point. Can move this easy enough.

Rogerio Alves and others added 3 commits August 26, 2019 13:18
This commit includes a CRC32 test (crc32_test). This tests are important
since some architectures may want include CPU dependent optimizations for
CRC32 algorithm like using vector instructions and we may want to
validate those.
$ make CFLAGS=--coverage SFLAGS=--coverage test
..
$ gcov crc32_power8.c  -m
File 'contrib/power8-crc/vec_crc32.c'
Lines executed:95.92% of 245
Creating 'vec_crc32.c.gcov'

multiple 32k blocks not tested (is in upstream however)
@grooverdan
Copy link
Author

rewritten much better in #478, Thanks @mscastanho

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants