Skip to content
This repository has been archived by the owner on Mar 12, 2020. It is now read-only.

common/trinary: Accelerate trit tryte conversion with SIMD SSE4.2 #1443

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open

common/trinary: Accelerate trit tryte conversion with SIMD SSE4.2 #1443

wants to merge 5 commits into from

Conversation

marktwtn
Copy link

@marktwtn marktwtn commented Oct 5, 2019

The acceleration is enabled when the compiler option -msse4.2 is used
and the input size of trit/tryte for conversion is larger than 128-bit,
which is the basic operation unit of the most SSE instructions.

The implementation is complex but it does acclerate the conversion speed
as the following experiment result reveals:

  • trits_to_trytes()
    Input size Without SIMD SSE4.2(avg nsec) With SIMD SSE4.2(avg nsec)
    81, 406.3, 261.5
    243, 444.6, 162.7
    6561, 53166.93, 27290.99

  • trytes_to_trits()
    Input size Without SIMD SSE4.2(avg nsec) With SIMD SSE4.2(avg nsec)
    81, 355.6, 167.1
    2592, 5752.3, 1751.8
    2673, 6273.0, 2098.8

For more detailed experiment result, please reference:
DLTcollab/dcurl#92

Test Plan:

$ bazel test --test_output=all --copt=-msse4.2 //common/trinary/tests/...
The command is for verifying the correctness.

@jserv
Copy link
Contributor

jserv commented Oct 5, 2019

@semenov-vladyslav, can you help review this pull request?

@oopsmonk oopsmonk added A-common Area - Common C-perf Category -Performance S-ready-for-review Status - Ready for review labels Oct 8, 2019
@marktwtn
Copy link
Author

marktwtn commented Oct 9, 2019

The CI tests failed.
It seems like the failed tests is related to the cross compilation since it has the bazel option --crosstool_top and --host_crosstool_top?

@oopsmonk
Copy link

The CI tests failed.
It seems like the failed tests is related to the cross compilation since it has the bazel option --crosstool_top and --host_crosstool_top?

the test failed on x86_64 toolchain, you can run it by bazel build --build_tests_only --config x86_64 //mam/troika/tests/...

@marktwtn
Copy link
Author

The command $ bazel build --build_tests_only --config x86_64 //... ran well on my desktop.
However, if I tried the command like the CI runs, it failed.

$ ./tools/ci_test_setup && bazel test --build_tests_only --remote_upload_local_results=false --crosstool_top=@iota_toolchains//tools/x86-64-core-i7--glibc--bleeding-edge-2018.07-1:toolchain --cpu=x86_64 --host_crosstool_top=@bazel_tools//tools/cpp:toolchain -- //... -//mobile/ios/...

Can I ask for more information about this command?
I am not really familiar with bazel.

@oopsmonk
Copy link

oopsmonk commented Oct 14, 2019

these two commands should have the same output. This PR has no compilation errors in my environment. it's related to the container will update later.
@marktwtn I was not fully wake up. The --config x86_64 uses the toolchain provided by bazel and another is the toolchain provide by bootlin for x86-64-core-i7.

@marktwtn
Copy link
Author

@marktwtn I was not fully wake up. The --config x86_64 uses the toolchain provided by bazel and another is the toolchain provide by bootlin for x86-64-core-i7.

I wonder what is the main difference of these toolchains.

@marktwtn
Copy link
Author

I fixed a bug and refactored some code in the updated branch.

The bug was not detected since the input data size of trits_to_trytes and trytes_to_trits testing are not large enough.

Here comes with 2 questions:

  1. Should I add another test for larger input data size or just modify the input data size of the original test?

  2. The test invoked by buildkite CI still fails on my desktop.

    $ bazel test --build_tests_only --remote_upload_local_results=false
     --crosstool_top=@iota_toolchains//tools/x86-64-core-i7--glibc--bleeding-edge-2018.07-1:toolchain
     --cpu=x86_64 --host_crosstool_top=@bazel_tools//tools/cpp:toolchain -- //... -//mobile/ios/...
    

    What is the main difference between these and the other tests on buildkite CI?
    All the error logs show that

    external/bazel_tools/tools/test/test-setup.sh: line 310:    
    14 Segmentation fault      (core dumped) "${TEST_PATH}" "$@" 2>&1  
    

    and I don't know why.

@oopsmonk
Copy link

oopsmonk commented Dec 10, 2019

Here comes with 2 questions:

1. Should I add another test for larger input data size or just modify the input data size of the original test?

yes, please.

2. The test invoked by buildkite CI still fails on my desktop.
   ```
   $ bazel test --build_tests_only --remote_upload_local_results=false
    --crosstool_top=@iota_toolchains//tools/x86-64-core-i7--glibc--bleeding-edge-2018.07-1:toolchain
    --cpu=x86_64 --host_crosstool_top=@bazel_tools//tools/cpp:toolchain -- //... -//mobile/ios/...
   ```
   
   
   What is the main difference between these and the other tests on buildkite CI?
   All the error logs show that
   ```
   external/bazel_tools/tools/test/test-setup.sh: line 310:    
   14 Segmentation fault      (core dumped) "${TEST_PATH}" "$@" 2>&1  
   ```
   
   
   and I don't know why.

It's caused by mysql/mariadb which may not installed on your system. since you are working on common part, using bazel test -- //common/... -//mobile/ios/... would work on your system.

You can run binaries in the bazel-bin folder, you can see memory mismatch in test cases if using the bootlin toolchain.

$ ./bazel-bin/common/crypto/iss/v1/tests/test_iss
common/crypto/iss/v1/tests/test_iss.c:97:signature_resolves_to_address_test:PASS
common/crypto/iss/v1/tests/test_iss.c:89:address_generation_test:FAIL: Memory Mismatch. Byte 0 Expected 0x44 Was 0x5A

-----------------------
2 Tests 1 Failures 0 Ignored
FAIL

$ ./bazel-bin/common/helpers/tests/test_pow
common/helpers/tests/test_pow.c:62:test_pow:FAIL: Memory Mismatch. Byte 0 Expected 0x57 Was 0x39
common/helpers/tests/test_pow.c:92:test_flex_pow:PASS

-----------------------
2 Tests 1 Failures 0 Ignored 
FAIL

@marktwtn
Copy link
Author

I found another bug and fixed it in the updated branch.
Sorry for not noticing that there are other error logs showing the memory mismatching problem.

I also increased the input data size of trit tryte conversion for testing SIMD SSE4.2 acceleration.


After the bug is fixed, most of the errors disappear.
However, there are 2 tests that do not pass.

  • common/model/tests:test_transaction
  • common/model/tests:test_tryte_transaction

with the testing command

bazel test --build_tests_only --remote_upload_local_results=false
--crosstool_top=@iota_toolchains//tools/x86-64-core-i7--glibc--bleeding-edge-2018.07-1:toolchain
--cpu=x86_64 --host_crosstool_top=@bazel_tools//tools/cpp:toolchain -- //common/... -//mobile/ios/..

and error log

external/bazel_tools/tools/test/test-setup.sh: line 310:    
14 Segmentation fault      (core dumped) "${TEST_PATH}" "$@" 2>&1  

@oopsmonk
Copy link

you can test the test_transaction directly and since it gets Segmentation fault you can enable address sanitizer like: bazel test -c dbg --config asan --config bootlin_x86_64_core_i7 //common/model/tests:test_transaction.
You can just replace //common/model/tests:test_transaction for the other case.

here is output on my system:

Executing tests from //common/model/tests:test_transaction
-----------------------------------------------------------------------------
AddressSanitizer:DEADLYSIGNAL
=================================================================
==14==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x55bda7e9241f bp 0x7ffd8a2eaa90 sp 0x7ffd8a2ea1c0 T0)
==14==The signal is caused by a READ memory access.
==14==Hint: address points to the zero page.
    #0 0x55bda7e9241e in _mm_store_si128 external/bootlin_x86_64_toolchain/lib/gcc/x86_64-buildroot-linux-gnu/8.1.0/include/emmintrin.h:714
    #1 0x55bda7e9241e in trits_to_trytes_sse42 common/trinary/trit_tryte_sse42.h:109
    #2 0x55bda7e9605e in trits_to_trytes common/trinary/trit_tryte.c:52
    #3 0x55bda7e8e922 in flex_trits_from_trits common/trinary/flex_trit.c:228
    #4 0x55bda7e8c331 in transaction_deserialize_trits common/model/transaction.c:112
    #5 0x55bda7e8d0b2 in transaction_deserialize_from_trits common/model/transaction.c:291
    #6 0x55bda7e8cfdf in transaction_deserialize common/model/transaction.c:251
    #7 0x55bda7e89c4b in test_deserialize_and_serialize common/model/tests/test_transaction.c:20
    #8 0x55bda7e999d8 in UnityDefaultTestRun external/unity/src/unity.c:1339
    #9 0x55bda7e8b296 in main common/model/tests/test_transaction.c:169
    #10 0x7fa548905b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
    #11 0x55bda7e882d9 in _start (/home/sam/.cache/bazel/_bazel_sam/c8834895b12547221dbe6a364e6a723e/execroot/org_iota_entangled/bazel-out/x86_64-dbg/bin/common/model/tests/test_transaction+0x62d9)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV external/bootlin_x86_64_toolchain/lib/gcc/x86_64-buildroot-linux-gnu/8.1.0/include/emmintrin.h:714 in _mm_store_si128
==14==ABORTING

@oopsmonk
Copy link

These two cases are failed on system toolchain as well when I run bazel test -c dbg --copt=-msse4.2 //common/...

  • //common/model/tests:test_transaction
  • //common/model/tests:test_tryte_transaction

@marktwtn
Copy link
Author

If I change the intrinsic function from __mm_store_si128() to __mm_storeu_si128(), the errors will be fixed.
Even with the bootlin toolchain, every test is passed.

The main difference of the intrinsic function is the requirement of address alignment to 16-byte boundary.
However, I still need further investigation to give a better explanation of the importance of the address alignment.
After the investigation is finished, I will update the conclusion here and update the branch, too.

@oopsmonk thank you for your advise and debugging help.

@marktwtn
Copy link
Author

Conclusion:

The load and store intrinsic function are unified to the unalignment type since we can not control the input pointer value.
The value is based on 8-byte alignment, but the alignment requirement of _mm_store_si128() and _mm_load_si128() is 16-byte alignment.

These two cases are failed on system toolchain as well when I run bazel test -c dbg --copt=-msse4.2 //common/...

* //common/model/tests:test_transaction

* //common/model/tests:test_tryte_transaction

The input pointer value of these two cases are 8-byte alignment but not 16-byte alignment, which cause the _mm_store_si128() to have general-protection exception.

I believe the unalignment intrinsic function is a better choice.


The pull request can be merged without any problem.
Is there anything I need to revise?
Such as change the variable naming convention to snake case?

Copy link

@oopsmonk oopsmonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use snake_case in variables for consistency.

@oopsmonk
Copy link

@marktwtn Could you write benchmarks for trits_to_trytes, trytes_to_trits and put it in //common/trinary/benchmark? thanks.

@marktwtn
Copy link
Author

@marktwtn Could you write benchmarks for trits_to_trytes, trytes_to_trits and put it in //common/trinary/benchmark? thanks.

It seems that there are no other benchmark in the other directories for me to reference.
I guess I will do it in my own way.
Or is there any guidance or rule to follow?

@oopsmonk
Copy link

@marktwtn Could you write benchmarks for trits_to_trytes, trytes_to_trits and put it in //common/trinary/benchmark? thanks.

It seems that there are no other benchmark in the other directories for me to reference.
I guess I will do it in my own way.
Or is there any guidance or rule to follow?

yep, printing out the time consumed by functions looks good, like DLTcollab/dcurl#92 (comment)

@oopsmonk
Copy link

oopsmonk commented Jan 6, 2020

Hey, @marktwtn how are you? do you need any help on Bazel or else?

@marktwtn
Copy link
Author

marktwtn commented Jan 6, 2020

Hey, @marktwtn how are you? do you need any help on Bazel or else?

@oopsmonk I found out that performance might not behave as I expected.

The input size and testing scenario would effect the performance result, hence I am still working on it.
If the size do effect the performance, I will modify the code to use the acceleration if the input is greater than a threshold size.

@marktwtn
Copy link
Author

@oopsmonk I will forced push later.

The modification includes:

marktwtn and others added 5 commits February 13, 2020 21:09
The acceleration is enabled when the compiler option -msse4.2 is used
and the input size of trit/tryte for conversion is larger than 384/128-bit.
128-bit is the basic operation unit of the most SSE instructions.

The implementation is complex but it does accelerate the conversion speed
with large input size.
In the original testing, the input trit/tryte data size are too small
to test the SIMD SSE4.2 trit tryte conversion acceleration.
trytes_to_trits minimum requirement:
128-bit = 16-tryte
trits_to_trytes minimum requirement:
384-bit = 48-trit
The benchmark displays the minimum, maximum and average value of trit tryte
conversion function of different input size.
The range of input size can be modified in bench_trit_tryte.c.
The default input/output tryte size range is 16 ~ 2048.
The threshold value is determined by the execution time difference.
The time difference should be at least 500 nano second.

The threshold experiment is run on the CPU:
AMD Ryzen 5 2400G with Radeon Vega Graphics.

TODO: trytes_to_trits() rarely slower when using SSE4.2 acceleration
with large input.
The trit tryte SSE4.2 accleration testing is added since the threshold
is added.
Otherwise, the acceleration would not be tested or the testing input
data need to be larger than the threshold value.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-common Area - Common C-perf Category -Performance S-ready-for-review Status - Ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants