common/trinary: Accelerate trit tryte conversion with SIMD SSE4.2 #1443

marktwtn · 2019-10-05T02:32:12Z

The acceleration is enabled when the compiler option -msse4.2 is used
and the input size of trit/tryte for conversion is larger than 128-bit,
which is the basic operation unit of the most SSE instructions.

The implementation is complex but it does acclerate the conversion speed
as the following experiment result reveals:

trits_to_trytes()
Input size Without SIMD SSE4.2(avg nsec) With SIMD SSE4.2(avg nsec)
81, 406.3, 261.5
243, 444.6, 162.7
6561, 53166.93, 27290.99
trytes_to_trits()
Input size Without SIMD SSE4.2(avg nsec) With SIMD SSE4.2(avg nsec)
81, 355.6, 167.1
2592, 5752.3, 1751.8
2673, 6273.0, 2098.8

For more detailed experiment result, please reference:
DLTcollab/dcurl#92

Test Plan:

$ bazel test --test_output=all --copt=-msse4.2 //common/trinary/tests/...
The command is for verifying the correctness.

jserv · 2019-10-05T04:28:38Z

@semenov-vladyslav, can you help review this pull request?

marktwtn · 2019-10-09T13:51:49Z

The CI tests failed.
It seems like the failed tests is related to the cross compilation since it has the bazel option --crosstool_top and --host_crosstool_top?

oopsmonk · 2019-10-11T09:40:26Z

The CI tests failed.
It seems like the failed tests is related to the cross compilation since it has the bazel option --crosstool_top and --host_crosstool_top?

the test failed on x86_64 toolchain, you can run it by bazel build --build_tests_only --config x86_64 //mam/troika/tests/...

marktwtn · 2019-10-12T06:37:03Z

The command $ bazel build --build_tests_only --config x86_64 //... ran well on my desktop.
However, if I tried the command like the CI runs, it failed.

$ ./tools/ci_test_setup && bazel test --build_tests_only --remote_upload_local_results=false --crosstool_top=@iota_toolchains//tools/x86-64-core-i7--glibc--bleeding-edge-2018.07-1:toolchain --cpu=x86_64 --host_crosstool_top=@bazel_tools//tools/cpp:toolchain -- //... -//mobile/ios/...

Can I ask for more information about this command?
I am not really familiar with bazel.

oopsmonk · 2019-10-14T07:50:07Z

~~these two commands should have the same output. This PR has no compilation errors in my environment. it's related to the container will update later.~~
@marktwtn I was not fully wake up. The --config x86_64 uses the toolchain provided by bazel and another is the toolchain provide by bootlin for x86-64-core-i7.

marktwtn · 2019-10-15T05:36:36Z

@marktwtn I was not fully wake up. The --config x86_64 uses the toolchain provided by bazel and another is the toolchain provide by bootlin for x86-64-core-i7.

I wonder what is the main difference of these toolchains.

common/trinary/trit_tryte_sse42.h

marktwtn · 2019-11-20T08:58:53Z

I fixed a bug and refactored some code in the updated branch.

The bug was not detected since the input data size of trits_to_trytes and trytes_to_trits testing are not large enough.

Here comes with 2 questions:

Should I add another test for larger input data size or just modify the input data size of the original test?

The test invoked by buildkite CI still fails on my desktop.

$ bazel test --build_tests_only --remote_upload_local_results=false
 --crosstool_top=@iota_toolchains//tools/x86-64-core-i7--glibc--bleeding-edge-2018.07-1:toolchain
 --cpu=x86_64 --host_crosstool_top=@bazel_tools//tools/cpp:toolchain -- //... -//mobile/ios/...

What is the main difference between these and the other tests on buildkite CI?
All the error logs show that

external/bazel_tools/tools/test/test-setup.sh: line 310:    
14 Segmentation fault      (core dumped) "${TEST_PATH}" "$@" 2>&1

and I don't know why.

oopsmonk · 2019-12-10T08:00:30Z

Here comes with 2 questions:

1. Should I add another test for larger input data size or just modify the input data size of the original test?

yes, please.

2. The test invoked by buildkite CI still fails on my desktop.
   ```
   $ bazel test --build_tests_only --remote_upload_local_results=false
    --crosstool_top=@iota_toolchains//tools/x86-64-core-i7--glibc--bleeding-edge-2018.07-1:toolchain
    --cpu=x86_64 --host_crosstool_top=@bazel_tools//tools/cpp:toolchain -- //... -//mobile/ios/...
   ```
   
   
   What is the main difference between these and the other tests on buildkite CI?
   All the error logs show that
   ```
   external/bazel_tools/tools/test/test-setup.sh: line 310:    
   14 Segmentation fault      (core dumped) "${TEST_PATH}" "$@" 2>&1  
   ```
   
   
   and I don't know why.

It's caused by mysql/mariadb which may not installed on your system. since you are working on common part, using bazel test -- //common/... -//mobile/ios/... would work on your system.

You can run binaries in the bazel-bin folder, you can see memory mismatch in test cases if using the bootlin toolchain.

$ ./bazel-bin/common/crypto/iss/v1/tests/test_iss
common/crypto/iss/v1/tests/test_iss.c:97:signature_resolves_to_address_test:PASS
common/crypto/iss/v1/tests/test_iss.c:89:address_generation_test:FAIL: Memory Mismatch. Byte 0 Expected 0x44 Was 0x5A

-----------------------
2 Tests 1 Failures 0 Ignored
FAIL

$ ./bazel-bin/common/helpers/tests/test_pow
common/helpers/tests/test_pow.c:62:test_pow:FAIL: Memory Mismatch. Byte 0 Expected 0x57 Was 0x39
common/helpers/tests/test_pow.c:92:test_flex_pow:PASS

-----------------------
2 Tests 1 Failures 0 Ignored 
FAIL

marktwtn · 2019-12-10T12:36:06Z

I found another bug and fixed it in the updated branch.
Sorry for not noticing that there are other error logs showing the memory mismatching problem.

I also increased the input data size of trit tryte conversion for testing SIMD SSE4.2 acceleration.

After the bug is fixed, most of the errors disappear.
However, there are 2 tests that do not pass.

common/model/tests:test_transaction
common/model/tests:test_tryte_transaction

with the testing command

bazel test --build_tests_only --remote_upload_local_results=false
--crosstool_top=@iota_toolchains//tools/x86-64-core-i7--glibc--bleeding-edge-2018.07-1:toolchain
--cpu=x86_64 --host_crosstool_top=@bazel_tools//tools/cpp:toolchain -- //common/... -//mobile/ios/..

and error log

external/bazel_tools/tools/test/test-setup.sh: line 310:    
14 Segmentation fault      (core dumped) "${TEST_PATH}" "$@" 2>&1

oopsmonk · 2019-12-11T03:35:43Z

you can test the test_transaction directly and since it gets Segmentation fault you can enable address sanitizer like: bazel test -c dbg --config asan --config bootlin_x86_64_core_i7 //common/model/tests:test_transaction.
You can just replace //common/model/tests:test_transaction for the other case.

here is output on my system:

Executing tests from //common/model/tests:test_transaction
-----------------------------------------------------------------------------
AddressSanitizer:DEADLYSIGNAL
=================================================================
==14==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x55bda7e9241f bp 0x7ffd8a2eaa90 sp 0x7ffd8a2ea1c0 T0)
==14==The signal is caused by a READ memory access.
==14==Hint: address points to the zero page.
    #0 0x55bda7e9241e in _mm_store_si128 external/bootlin_x86_64_toolchain/lib/gcc/x86_64-buildroot-linux-gnu/8.1.0/include/emmintrin.h:714
    #1 0x55bda7e9241e in trits_to_trytes_sse42 common/trinary/trit_tryte_sse42.h:109
    #2 0x55bda7e9605e in trits_to_trytes common/trinary/trit_tryte.c:52
    #3 0x55bda7e8e922 in flex_trits_from_trits common/trinary/flex_trit.c:228
    #4 0x55bda7e8c331 in transaction_deserialize_trits common/model/transaction.c:112
    #5 0x55bda7e8d0b2 in transaction_deserialize_from_trits common/model/transaction.c:291
    #6 0x55bda7e8cfdf in transaction_deserialize common/model/transaction.c:251
    #7 0x55bda7e89c4b in test_deserialize_and_serialize common/model/tests/test_transaction.c:20
    #8 0x55bda7e999d8 in UnityDefaultTestRun external/unity/src/unity.c:1339
    #9 0x55bda7e8b296 in main common/model/tests/test_transaction.c:169
    #10 0x7fa548905b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
    #11 0x55bda7e882d9 in _start (/home/sam/.cache/bazel/_bazel_sam/c8834895b12547221dbe6a364e6a723e/execroot/org_iota_entangled/bazel-out/x86_64-dbg/bin/common/model/tests/test_transaction+0x62d9)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV external/bootlin_x86_64_toolchain/lib/gcc/x86_64-buildroot-linux-gnu/8.1.0/include/emmintrin.h:714 in _mm_store_si128
==14==ABORTING

oopsmonk · 2019-12-11T04:06:30Z

These two cases are failed on system toolchain as well when I run bazel test -c dbg --copt=-msse4.2 //common/...

//common/model/tests:test_transaction
//common/model/tests:test_tryte_transaction

marktwtn · 2019-12-11T05:56:44Z

If I change the intrinsic function from __mm_store_si128() to __mm_storeu_si128(), the errors will be fixed.
Even with the bootlin toolchain, every test is passed.

The main difference of the intrinsic function is the requirement of address alignment to 16-byte boundary.
However, I still need further investigation to give a better explanation of the importance of the address alignment.
After the investigation is finished, I will update the conclusion here and update the branch, too.

@oopsmonk thank you for your advise and debugging help.

marktwtn · 2019-12-13T07:20:36Z

Conclusion:

The load and store intrinsic function are unified to the unalignment type since we can not control the input pointer value.
The value is based on 8-byte alignment, but the alignment requirement of _mm_store_si128() and _mm_load_si128() is 16-byte alignment.

These two cases are failed on system toolchain as well when I run bazel test -c dbg --copt=-msse4.2 //common/...
* //common/model/tests:test_transaction

* //common/model/tests:test_tryte_transaction

The input pointer value of these two cases are 8-byte alignment but not 16-byte alignment, which cause the _mm_store_si128() to have general-protection exception.

I believe the unalignment intrinsic function is a better choice.

The pull request can be merged without any problem.
Is there anything I need to revise?
Such as change the variable naming convention to snake case?

oopsmonk

please use snake_case in variables for consistency.

oopsmonk · 2019-12-19T08:59:01Z

@marktwtn Could you write benchmarks for trits_to_trytes, trytes_to_trits and put it in //common/trinary/benchmark? thanks.

marktwtn · 2019-12-19T09:09:10Z

@marktwtn Could you write benchmarks for trits_to_trytes, trytes_to_trits and put it in //common/trinary/benchmark? thanks.

It seems that there are no other benchmark in the other directories for me to reference.
I guess I will do it in my own way.
Or is there any guidance or rule to follow?

oopsmonk · 2019-12-19T09:36:09Z

@marktwtn Could you write benchmarks for trits_to_trytes, trytes_to_trits and put it in //common/trinary/benchmark? thanks.

It seems that there are no other benchmark in the other directories for me to reference.
I guess I will do it in my own way.
Or is there any guidance or rule to follow?

yep, printing out the time consumed by functions looks good, like DLTcollab/dcurl#92 (comment)

oopsmonk · 2020-01-06T03:47:46Z

Hey, @marktwtn how are you? do you need any help on Bazel or else?

marktwtn · 2020-01-06T09:00:04Z

Hey, @marktwtn how are you? do you need any help on Bazel or else?

@oopsmonk I found out that performance might not behave as I expected.

The input size and testing scenario would effect the performance result, hence I am still working on it.
If the size do effect the performance, I will modify the code to use the acceleration if the input is greater than a threshold size.

marktwtn · 2020-02-13T12:27:07Z

@oopsmonk I will forced push later.

The modification includes:

The benchmark.
The threshold for using SSE4.2 acceleration.
The commit message is changed. The experiment result from the Optimize Trits/Trytes manipulation with SIMD techniques DLTcollab/dcurl#92 (comment) is removed since we have benchmark for different input size now.

The acceleration is enabled when the compiler option -msse4.2 is used and the input size of trit/tryte for conversion is larger than 384/128-bit. 128-bit is the basic operation unit of the most SSE instructions. The implementation is complex but it does accelerate the conversion speed with large input size.

In the original testing, the input trit/tryte data size are too small to test the SIMD SSE4.2 trit tryte conversion acceleration. trytes_to_trits minimum requirement: 128-bit = 16-tryte trits_to_trytes minimum requirement: 384-bit = 48-trit

The benchmark displays the minimum, maximum and average value of trit tryte conversion function of different input size. The range of input size can be modified in bench_trit_tryte.c. The default input/output tryte size range is 16 ~ 2048.

The threshold value is determined by the execution time difference. The time difference should be at least 500 nano second. The threshold experiment is run on the CPU: AMD Ryzen 5 2400G with Radeon Vega Graphics. TODO: trytes_to_trits() rarely slower when using SSE4.2 acceleration with large input.

The trit tryte SSE4.2 accleration testing is added since the threshold is added. Otherwise, the acceleration would not be tested or the testing input data need to be larger than the threshold value.

marktwtn requested review from oopsmonk, thibault-martinez and tsvisabo as code owners October 5, 2019 02:32

oopsmonk added A-common Area - Common C-perf Category -Performance S-ready-for-review Status - Ready for review labels Oct 8, 2019

semenov-vladyslav reviewed Oct 28, 2019

View reviewed changes

common/trinary/trit_tryte_sse42.h Outdated Show resolved Hide resolved

semenov-vladyslav reviewed Oct 28, 2019

View reviewed changes

common/trinary/trit_tryte_sse42.h Outdated Show resolved Hide resolved

semenov-vladyslav reviewed Oct 28, 2019

View reviewed changes

common/trinary/trit_tryte_sse42.h Outdated Show resolved Hide resolved

oopsmonk suggested changes Dec 16, 2019

View reviewed changes

marktwtn requested a review from oopsmonk December 16, 2019 10:46

marktwtn and others added 5 commits February 13, 2020 21:09

common/trinary: Add trit tryte SSE4.2 acceleration testing

5dc9d89

The trit tryte SSE4.2 accleration testing is added since the threshold is added. Otherwise, the acceleration would not be tested or the testing input data need to be larger than the threshold value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common/trinary: Accelerate trit tryte conversion with SIMD SSE4.2 #1443

common/trinary: Accelerate trit tryte conversion with SIMD SSE4.2 #1443

marktwtn commented Oct 5, 2019

jserv commented Oct 5, 2019

marktwtn commented Oct 9, 2019

oopsmonk commented Oct 11, 2019

marktwtn commented Oct 12, 2019

oopsmonk commented Oct 14, 2019 •

edited

Loading

marktwtn commented Oct 15, 2019

marktwtn commented Nov 20, 2019

oopsmonk commented Dec 10, 2019 •

edited

Loading

marktwtn commented Dec 10, 2019

oopsmonk commented Dec 11, 2019

oopsmonk commented Dec 11, 2019

marktwtn commented Dec 11, 2019

marktwtn commented Dec 13, 2019

oopsmonk left a comment

oopsmonk commented Dec 19, 2019

marktwtn commented Dec 19, 2019

oopsmonk commented Dec 19, 2019

oopsmonk commented Jan 6, 2020

marktwtn commented Jan 6, 2020

marktwtn commented Feb 13, 2020

common/trinary: Accelerate trit tryte conversion with SIMD SSE4.2 #1443

Are you sure you want to change the base?

common/trinary: Accelerate trit tryte conversion with SIMD SSE4.2 #1443

Conversation

marktwtn commented Oct 5, 2019

Test Plan:

jserv commented Oct 5, 2019

marktwtn commented Oct 9, 2019

oopsmonk commented Oct 11, 2019

marktwtn commented Oct 12, 2019

oopsmonk commented Oct 14, 2019 • edited Loading

marktwtn commented Oct 15, 2019

marktwtn commented Nov 20, 2019

oopsmonk commented Dec 10, 2019 • edited Loading

marktwtn commented Dec 10, 2019

oopsmonk commented Dec 11, 2019

oopsmonk commented Dec 11, 2019

marktwtn commented Dec 11, 2019

marktwtn commented Dec 13, 2019

oopsmonk left a comment

Choose a reason for hiding this comment

oopsmonk commented Dec 19, 2019

marktwtn commented Dec 19, 2019

oopsmonk commented Dec 19, 2019

oopsmonk commented Jan 6, 2020

marktwtn commented Jan 6, 2020

marktwtn commented Feb 13, 2020

oopsmonk commented Oct 14, 2019 •

edited

Loading

oopsmonk commented Dec 10, 2019 •

edited

Loading