Skip to content

crypto.sha3: rewrite and optimize kaccak_p_1600_24() engine, update tests#26524

Open
tankf33der wants to merge 1 commit intovlang:masterfrom
tankf33der:sha3
Open

crypto.sha3: rewrite and optimize kaccak_p_1600_24() engine, update tests#26524
tankf33der wants to merge 1 commit intovlang:masterfrom
tankf33der:sha3

Conversation

@tankf33der
Copy link
Contributor

I finally want to show the patch for accelerating sha3 performance.
This is approximately the 4th generation patch from a multi-week development and fun.
It all started with a patch that speeds up by 10%, and ended up with a multi-fold speedup for both tcc and gcc.

If you take my standard file for sha3 performance testing, you can see multiple function calls inside the rounds, once I conquered that it was just a matter of technique.

import crypto.sha3
import time

fn main() {
	a := []u8{len: 10_000_000}
	t1 := time.now()
	_ := sha3.sum512(a)
	println(time.since(t1))
}
        138889         93.624ms         46.706ms            674ns crypto__sha3__State_xor_bytes 
       1250001         46.917ms         46.917ms             38ns encoding__binary__little_endian_u64_at 
       3333336         83.607ms         83.607ms             25ns crypto__sha3__State_iota 
        138889       8219.910ms        101.634ms          59183ns crypto__sha3__State_kaccak_p_1600_24 
       3333336        522.927ms        522.927ms            157ns crypto__sha3__State_pi 
       3333336       8118.276ms        556.868ms           2435ns crypto__sha3__State_rnd 
       3333336        684.678ms        684.678ms            205ns crypto__sha3__State_chi 
       3333336       1454.097ms       1026.246ms            436ns crypto__sha3__State_theta 
     100000080       2475.980ms       2475.980ms             25ns math__bits__rotate_left_64 
       3333336       4816.100ms       2767.971ms           1445ns crypto__sha3__State_rho 

and even if you check whether the compiler inlined them, it still turns out to be costly.
Besides, the official site suggests merging several functions into one and then they are not needed at all.
The latest generation of the patch consists of simply unrolling the loops and making them less costly.
Had to tinker with it.
I have my own tests with full coverage for files with test vectors and openssl calls so I'm not worried.

Now the profiler shows normal metrics:

             2          0.010ms          0.010ms           5018ns builtin___write_buf_to_fd 
             2          0.010ms          0.010ms           5174ns builtin___v_malloc 
             2          0.019ms          0.017ms           9376ns time__linux_now 
             6          1.239ms          1.239ms         206538ns builtin__vcalloc_noscan 
        277779         10.739ms         10.739ms             39ns builtin__array_slice 
             1       5363.798ms         18.982ms     5363798118ns crypto__sha3__Digest_write 
        138889         91.799ms         45.508ms            661ns crypto__sha3__State_xor_bytes 
       1250001         46.292ms         46.292ms             37ns encoding__binary__little_endian_u64_at 
      96666744       2336.159ms       2336.159ms             24ns math__bits__rotate_left_64 
        138889       5242.316ms       2906.158ms          37745ns crypto__sha3__State_kaccak_p_1600_24 

Had to sacrifice some tests because they became impossible, there's simply no code that they rely on.

Speed up: tcc ~4.5+ times, gcc ~3+ times

@tankf33der
Copy link
Contributor Author

@blackshirt take a look. Of course I've tested it with your pslhdsa implementation.

@tankf33der
Copy link
Contributor Author

@kimshrier - take a look. What you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant