Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chacha20: Improve AVX2 performance #261

Merged
merged 3 commits into from
Aug 9, 2021

Conversation

str4d
Copy link
Contributor

@str4d str4d commented Aug 9, 2021

  • A bunch of instructions for accessing the 128-bit lanes have been replaced by a union.
  • I've implemented a widely-used ChaCha optimisation (that I spotted in c2-chacha).

This removes a bunch of instructions for accessing the 128-bit lanes.
The `b` state word is on the hot path, so we pivot the diagonalization
to move the shuffles onto the other state words. See the code comment,
or sneves/blake2-avx2#4 for additional details.
@str4d
Copy link
Contributor Author

str4d commented Aug 9, 2021

Ran cargo +nightly bench -p chacha20 in WSL2 on my desktop (i7-8700K overclocked to 4.8GHz).

$ cargo +nightly --version
cargo 1.55.0-nightly (d21c22870 2021-07-26)

Current master:

     Running unittests (target/release/deps/chacha12-f20b37e08443492e)
test bench1_10     ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100    ... bench:          52 ns/iter (+/- 1) = 1923 MB/s
test bench3_1000   ... bench:         739 ns/iter (+/- 53) = 1353 MB/s
test bench4_10000  ... bench:       7,576 ns/iter (+/- 277) = 1319 MB/s
test bench5_100000 ... bench:      76,180 ns/iter (+/- 3,424) = 1312 MB/s

     Running unittests (target/release/deps/chacha20-5391fc4cd79914d1)
test bench1_10     ... bench:          12 ns/iter (+/- 0) = 833 MB/s
test bench2_100    ... bench:          80 ns/iter (+/- 7) = 1250 MB/s
test bench3_1000   ... bench:         952 ns/iter (+/- 26) = 1050 MB/s
test bench4_10000  ... bench:       9,765 ns/iter (+/- 407) = 1024 MB/s
test bench5_100000 ... bench:      97,608 ns/iter (+/- 3,413) = 1024 MB/s

     Running unittests (target/release/deps/chacha8-58d30d30e94644a2)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          41 ns/iter (+/- 2) = 2439 MB/s
test bench3_1000   ... bench:         625 ns/iter (+/- 21) = 1600 MB/s
test bench4_10000  ... bench:       6,495 ns/iter (+/- 217) = 1539 MB/s
test bench5_100000 ... bench:      65,487 ns/iter (+/- 3,186) = 1527 MB/s

After introducing backend::avx2::StateWord union:

     Running unittests (target/release/deps/chacha12-f20b37e08443492e)
test bench1_10     ... bench:          10 ns/iter (+/- 0) = 1000 MB/s
test bench2_100    ... bench:          50 ns/iter (+/- 1) = 2000 MB/s
test bench3_1000   ... bench:         411 ns/iter (+/- 22) = 2433 MB/s
test bench4_10000  ... bench:       4,053 ns/iter (+/- 207) = 2467 MB/s
test bench5_100000 ... bench:      39,887 ns/iter (+/- 1,633) = 2507 MB/s

     Running unittests (target/release/deps/chacha20-5391fc4cd79914d1)
test bench1_10     ... bench:          12 ns/iter (+/- 0) = 833 MB/s
test bench2_100    ... bench:          74 ns/iter (+/- 1) = 1351 MB/s
test bench3_1000   ... bench:         636 ns/iter (+/- 38) = 1572 MB/s
test bench4_10000  ... bench:       6,189 ns/iter (+/- 690) = 1615 MB/s
test bench5_100000 ... bench:      61,537 ns/iter (+/- 2,681) = 1625 MB/s

     Running unittests (target/release/deps/chacha8-58d30d30e94644a2)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          38 ns/iter (+/- 2) = 2631 MB/s
test bench3_1000   ... bench:         303 ns/iter (+/- 12) = 3300 MB/s
test bench4_10000  ... bench:       2,910 ns/iter (+/- 98) = 3436 MB/s
test bench5_100000 ... bench:      28,850 ns/iter (+/- 1,079) = 3466 MB/s

After diagonalization optimization:

     Running unittests (target/release/deps/chacha12-a4a957c0b61fe24c)
test bench1_10     ... bench:          10 ns/iter (+/- 0) = 1000 MB/s
test bench2_100    ... bench:          49 ns/iter (+/- 3) = 2040 MB/s
test bench3_1000   ... bench:         391 ns/iter (+/- 14) = 2557 MB/s
test bench4_10000  ... bench:       3,799 ns/iter (+/- 95) = 2632 MB/s
test bench5_100000 ... bench:      37,872 ns/iter (+/- 1,195) = 2640 MB/s

     Running unittests (target/release/deps/chacha20-cb419216b50db2c1)
test bench1_10     ... bench:          12 ns/iter (+/- 0) = 833 MB/s
test bench2_100    ... bench:          72 ns/iter (+/- 3) = 1388 MB/s
test bench3_1000   ... bench:         599 ns/iter (+/- 21) = 1669 MB/s
test bench4_10000  ... bench:       5,858 ns/iter (+/- 308) = 1707 MB/s
test bench5_100000 ... bench:      58,155 ns/iter (+/- 2,432) = 1719 MB/s

     Running unittests (target/release/deps/chacha8-276d7f24b799657c)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          37 ns/iter (+/- 1) = 2702 MB/s
test bench3_1000   ... bench:         291 ns/iter (+/- 12) = 3436 MB/s
test bench4_10000  ... bench:       2,792 ns/iter (+/- 76) = 3581 MB/s
test bench5_100000 ... bench:      27,780 ns/iter (+/- 1,477) = 3599 MB/s

@tarcieri tarcieri merged commit 99577d6 into RustCrypto:master Aug 9, 2021
@str4d str4d deleted the chacha20-backend-perf branch August 9, 2021 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants