-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized CRC32 framework with Power crc32 c optimized function #335
Conversation
…hard/crc32-vpmsum/ This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>
Correct argument types and ensure that a buffer pointer of 0 returns 0ULL consistent with existing crc32 functions.
To support runtime optimization of crc32 GNU IFUNC capabilities allows zlib to return an optimized crc32_z function pointer that is resolved at runtime to the existing crc32_z name and is able to be used by existing applications. There are two code forms in which this can be defined; a native attribute, and; an asm type defination which will work with older gcc compilers. crc32_ifunc is added as a function that is called by the glibc loader if the IFUNC capability exists to resolve the crc32_z function. If the IFUNC capabilies don't exists it will otherwise returns a function pointer on the first instigation of the crc32_z function call. For staticly compiled code the function pointer variant of this function is used. crc32_ifunc provides a point of expansion for returning other optimized crc32 implementations for other architectures. DYNAMIC_CRC_TABLE/make_crc_table now occurs to the crc32_ifunc and only if an crc32 function (crc32_big/crc32_little/ crc32_table_lookup) that use the generated table. As a result lazy binding occurs (the default for glibc) on the calling of make_crc_table occurs on the first use of crc32/crc32_z. Compile time linker options, environment LD_BIND_NOW, hardened compilers etc, will solve functions, i.e. the IFUNC, on symbold initialisation to occur as the program is loaded rather than on first use of crc32/crc32_z. The generation of this table will be farely minor compared to the other non-lazy bindings occuring. As crc32_big/crc32_little are optimized functions these are used as a fallback to any optimized implemented (provided NO_BYFOUR isn't defined) these will now be called directly for a crc32/ crc32_z and as such the 'if (buf == Z_NULL) return 0UL;' needed to be introduced to these functions. The table lookup implementation of crc32 is moved to crc32_table_lookup and used a function of last resort.
Power Architecture is detected in the configure script and adds optimized code to PIC_OBJC/OBJC. Power8 crc32 performance ------------------------ Test - decompressing a jdk binary: Before (no optimized crc32_vpmsum (disabled in crc32_z_ifunc): $ time ./minigzip -d -c ../ibm-java-i386-sdk-8.0-5.0.bin.gz > ../ibm-java-i386-sdk-8.0-5.0.bin.restored real 0m2.972s user 0m2.292s sys 0m0.100s perf report -g --no-children: 52.26% minigzip minigzip [.] crc32_little 18.86% minigzip minigzip [.] inflate_fast 4.87% minigzip [unknown] [k] 0xc000000000063748 4.87% minigzip libc-2.23.so [.] __memcpy_power7 1.56% minigzip minigzip [.] inflate 0.96% minigzip minigzip [.] inflate_table 0.95% minigzip libc-2.23.so [.] _IO_fwrite 0.61% minigzip minigzip [.] inflateCodesUsed Using crc32_vpmsum: $ time ./minigzip -d -c ../ibm-java-i386-sdk-8.0-5.0.bin.gz > ../ibm-java-i386-sdk-8.0-5.0.bin.restored real 0m0.895s user 0m0.224s sys 0m0.092s perf report -g --no-children: 36.49% minigzip minigzip [.] inflate_fast 11.60% minigzip [unknown] [k] 0xc000000000063748 7.93% minigzip libc-2.23.so [.] __memcpy_power7 3.77% minigzip minigzip [.] crc32_vpmsum 3.70% minigzip minigzip [.] inflate_table 2.29% minigzip minigzip [.] inflate
const unsigned char FAR *buf; | ||
z_size_t len; | ||
{ | ||
static unsigned long ZEXPORT (*crc32_func)(unsigned long, const unsigned char FAR *, z_size_t) = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a static plus invoking a function pointer seems potentially more expensive than testing for a bool (at least on ARM it was).
I'm talking about:
https://chromium-review.googlesource.com/c/chromium/src/+/801108/12/third_party/zlib/crc32.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you got a test framework you tried on this? I'm keen to make something efficient across all platforms. I'm trusting the glibc folks a lot to get IFUNC right for the shard library case as the crc32_z_ifunc return pointer becomes crc32_z in the trampolines and wouldn't mind to validate efficiencies of function pointer vs bool vs ifunc ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our optimization work in zlib targeted to improve PNG decoding performance in Chromium (that has both tracing tools as also an image_decoder benchmarker).
The results using a function pointer were worst in the device tested (ARMv8) for the PNG corpus we used (but remember that Chromium is built for ARM using clang).
Also Chromium has just added a compression/decompression benchmarker, it may be helpful to have a look on:
https://cs.chromium.org/chromium/src/third_party/zlib/contrib/bench/
unsigned long crc; | ||
const unsigned char FAR *buf; | ||
z_size_t len; | ||
{ | ||
if (buf == Z_NULL) return 0UL; | ||
|
||
crc = crc ^ 0xffffffffUL; | ||
while (len >= 8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to add some considerable amount of code in zlib's crc32 file.
I wonder if you could integrate your patch in a less intrusive way? (i.e. less is better)
I think a good target would be to try to easy the burden of maintenance of zlib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit is largely moving the table implementation to its own independent function. It lowers the maintenance burden by uncoupling it from the other optimized implementations.
There is also a bit of boiler plate added here to support two forms of glibc compilation and a static compilation and non-glibc (function pointer).
I could drop out the glibc ifunc however this it a bit that provides an advantage for ARM and PowerPC64 and provides a way to push out the hardware feature detection out of the main line of the crc32 calculation. The function call in a shared library is a trampoline function pointer anyway as I understand it so its keeping it to one is probably beneficial.
I've cleanly separated each commit to make it easier to review though a2f5adc doesn't look as intuitive as the real change is.
@@ -826,6 +826,91 @@ EOF | |||
fi | |||
fi | |||
|
|||
# test to see if Power8+ implementation is compile time possible | |||
echo >> configure.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you forgot about the CMake buildsystem?
I personally think that any new feature should be added only in the CMakeLists.txt (and consider the configure script in some kind of minimal maintenance mode for legacy systems).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unsure as to how up to date this was. I can add this configuration easy enough. Thanks for the advice.
static unsigned long ZEXPORT (*crc32_func)(unsigned long, const unsigned char FAR *, z_size_t) = NULL; | ||
|
||
if (!crc32_func) | ||
crc32_func = crc32_z_ifunc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential issue: would this work in a multithreaded application? Seems a bit race-condition-prone.
Maybe use a pthread_lock or pthread_once for the initialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to validate this with a highly parallel application (i.e. Chrome) or pigz (https://github.com/madler/pigz).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question about thread safety. I put it to https://github.com/Dead2/zlib-ng/issues/147 who rely on function pointers extensively. Though they would have the equivalent of a crc32_func pointing at crc32_z_ifunc which modified the function pointer. I think I could do an approach that way which would avoid the high 32bits of crc32_func being modified, checked in another thread to be !0 and then called without the lower 32bits being set. Having said that I was unable to generate a normal ASM code that did it that way.
Does pthread_once work for initialising in static libraries? I'll try to avoid locks in the path as that seems more harmful than function pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@grooverdan they should work as zlib is built as a static lib in Chromium.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamic:
$ LD_LIBRARY_PATH=../zlib ldd ./pigz linux-vdso64.so.1 => (0x00003fff8aaa0000) libm.so.6 => /lib/powerpc64le-linux-gnu/libm.so.6 (0x00003fff8a9a0000) libpthread.so.0 => /lib/powerpc64le-linux-gnu/libpthread.so.0 (0x00003fff8a960000) libz.so.1 => ../zlib/libz.so.1 (0x00003fff8a910000) libc.so.6 => /lib/powerpc64le-linux-gnu/libc.so.6 (0x00003fff8a740000) /lib64/ld64.so.2 (0x00003fff8aac0000)
Tested with:
$ LD_LIBRARY_PATH=../zlib ./pigz -J 4096 -c /var/lib/libvirt/images/ceph.ceph0.img | LD_LIBRARY_PATH=../zlib-1.2.11/ ./unpigz -J 4096 > /dev/null
(where the 1.2.11 was the last release compiled version)
static compile of pigz
danielgb@p87:~/pigz$ gcc -o pigz pigz.o yarn.o try.o deflate.o blocksplitter.o tree.o lz77.o cache.o hash.o util.o squeeze.o katajainen.o -lm -lpthread -Wl,-Bstatic -L ../zlib -lz -Wl,-Bdynamic danielgb@p87:~/pigz$ LD_LIBRARY_PATH=../zlib ldd pigz linux-vdso64.so.1 => (0x00003fffa5c50000) libm.so.6 => /lib/powerpc64le-linux-gnu/libm.so.6 (0x00003fffa5b50000) libpthread.so.0 => /lib/powerpc64le-linux-gnu/libpthread.so.0 (0x00003fffa5b10000) libc.so.6 => /lib/powerpc64le-linux-gnu/libc.so.6 (0x00003fffa5940000) /lib64/ld64.so.2 (0x00003fffa5c70000)
tested fine on same test
zlib_bench:
(below where run on a loaded server, relative comparisons may not be sane)
dynamic:
LD_LIBRARY_PATH=zlib ./zlib_bench_clang zlib /var/lib/libvirt/images/guest-be-6.img /var/lib/libvirt/images/guest-be-6.img : ZLIB: [b 1M] bytes 1073741824 -> 1073110 0.1% comp 100.0 (100.2) MB/s uncomp 182.4 (183.9) MB/s
static:
danielgb@p87:~$ clang++ -O3 -Wall -std=c++11 -lstdc++ -I zlib zlib_bench.cc zlib/libz.a -o zlib_bench_clang_static $ ./zlib_bench_clang_static zlib /var/lib/libvirt/images/guest-be-6.img /var/lib/libvirt/images/guest-be-6.img : ZLIB: [b 1M] bytes 1073741824 -> 1073110 0.1% comp 100.4 (100.5) MB/s uncomp 183.8 (184.0) MB/s
shouldn't this be somewhere in |
@julian-klode good point. Can move this easy enough. |
This commit includes a CRC32 test (crc32_test). This tests are important since some architectures may want include CPU dependent optimizations for CRC32 algorithm like using vector instructions and we may want to validate those.
$ make CFLAGS=--coverage SFLAGS=--coverage test .. $ gcov crc32_power8.c -m File 'contrib/power8-crc/vec_crc32.c' Lines executed:95.92% of 245 Creating 'vec_crc32.c.gcov' multiple 32k blocks not tested (is in upstream however)
rewritten much better in #478, Thanks @mscastanho |
@madler ,
On this year anniversary of the previous release I present some code changes to enable runtime selection of optimized crc32 code on glibc platforms, and falls back to a function pointer implementation for other platforms and static libraries. These capabilities are detected at configure time.
I have also include an optimized crc32 for IBM Power8 and newer hardware.
Other arches like in #251 should be able to leverage this framework with minimal effort.
I authorize these changes to be distributed under the zlib license.
I'm happy to make changes to meet maintainability requirements.