Skip to content

Conversation

@jedisct1
Copy link
Contributor

@jedisct1 jedisct1 commented Oct 15, 2025

KT128 and KT256 are fast, secure cryptographic hash functions based on Keccak (SHA-3).

They can be seen as the modern version of SHA-3, and an evolution of SHAKE, with better performance.

After the SHA-3 competition, the Keccak team proposed these variants in 2016, and the constructions underwent 8 years of public scrutiny before being standardized in October 2025 as RFC 9861.

They uses a tree-hashing mode on top of TurboSHAKE, providing both high security and excellent performance, especially on large inputs.

They support arbitrary-length output and optional customization strings.

Hashing of very large inputs can be done using multiple threads, for high throughput.

KT128 provides 128-bit security strength, equivalent to AES-128 and SHAKE128, which is sufficient for virtually all applications.

KT256 provides 256-bit security strength.
For small inputs, TurboSHAKE128 and TurboSHAKE256 (which KT128 and KT256 are based on) can be used instead as they have less overhead.

@jedisct1
Copy link
Contributor Author

Lazy copy/paste of the benchmarks from the parallel BLAKE3 PR:

Apple M1

==========================================================================================================
SUMMARY TABLE - All throughput values in MB/s
==========================================================================================================
Size         Chunks |      SHA256      BLAKE3  BLAKE3-Par  TurboSH128   KT128-Seq   KT128-Par
---------- -------- + ----------- ----------- ----------- ----------- ----------- -----------
64 B              1 |     1172.13      574.66       60.64      506.88      356.59      356.00
1 KB              1 |     2179.96      747.71      440.60     1262.64      598.69      598.90
8 KB              1 |     2294.30     1465.83     1241.32     1452.02     1337.35     1150.79
64 KB             8 |     2311.54     1508.80     1471.13     1464.32     1713.64     1665.33
1 MB            128 |     2309.67     1511.87     1504.40     1453.29     2572.95     2561.34
10 MB          1280 |     2307.63     1509.54     5229.25     1442.97     2626.67     9356.03
100 MB        12800 |     2310.07     1508.31     7632.71     1443.38     2643.04    12152.51
200 MB        25600 |     2311.98     1509.60     8419.46     1443.25     2601.17    13479.36
==========================================================================================================

AMD Zen4

==========================================================================================================
SUMMARY TABLE - All throughput values in MB/s
==========================================================================================================
Size         Chunks |      SHA256      BLAKE3  BLAKE3-Par  TurboSH128   KT128-Seq   KT128-Par
---------- -------- + ----------- ----------- ----------- ----------- ----------- -----------
64 B              1 |      878.24      523.53       97.42      395.50      293.91      295.34
1 KB              1 |     1486.90      720.74      521.66      931.96      477.53      478.87
8 KB              1 |     1553.39     3691.62     2924.73     1070.49      993.06      919.36
64 KB             8 |     1566.99     5020.52     4800.96     1075.00     1681.94     1656.39
1 MB            128 |     1565.47     5133.86     5113.38     1073.80     4219.76     4204.44
10 MB          1280 |     1561.68     5120.92     9344.03     1074.22     4627.68    11656.27
100 MB        12800 |     1563.46     3481.63    14390.99     1074.64     4560.84    24914.64
200 MB        25600 |     1563.00     3380.68    16670.07     1075.43     4557.86    26870.09
==========================================================================================================

KT128 and KT256 are the fastest cryptographic hash functions for large inputs.

However, tree hashing+threads can be pretty damaging to other applications in concurrent scenarios.

@jedisct1
Copy link
Contributor Author

Let's wait for #25592 to land.

KT128 and KT256 are fast, secure cryptographic hash functions based
on Keccak (SHA-3).

They can be seen as the modern version of SHA-3, and evolution of
SHAKE, with better performance.

After the SHA-3 competition, the Keccak team proposed these variants
in 2016, and the constructions underwent 8 years of public scrutiny
before being standardized in October 2025 as RFC 9861.

They uses a tree-hashing mode on top of TurboSHAKE, providing both
high security and excellent performance, especially on large inputs.

They support arbitrary-length output and optional customization strings.

Hashing of very large inputs can be done using multiple threads, for
high throughput.

KT128 provides 128-bit security strength, equivalent to AES-128 and
SHAKE128, which is sufficient for virtually all applications.

KT256 provides 256-bit security strength.
For small inputs, TurboSHAKE128 and TurboSHAKE256 (which KT128 and
KT256 are based on) can be used instead as they have less overhead.
* master:
  fix typo in std.debug.ElfFile.loadSeparateDebugFile
  Revert "ci: stop building FreeBSD module tests on x86_64-linux"
  Io: fix some horrible data races and UAFs caused by `Condition` misuse
* master:
  Implement threaded BLAKE3 (ziglang#25587)
  std: Skip element comparisons if `mem.order` args point to same memory
  std.Target: bump vulkan max version to 1.4.331
  std.Target: bump opencl/nvcl max version to 3.0.19
  std.Target: bump cuda max version to 13.0.2
  std.Target: bump amdhsa max version to 7.1.0
  std.Target: bump wasi max version to 0.3.0
  std.Target: bump dragonfly max version to 6.4.2
  std.Target: bump linux max version to 6.17
  std.Target: bump fuchsia max version to 28.0.0
  std.Target: bump contiki max version to 5.1.0
  test: remove some unsupported x86_64 darwin targets from llvm_targets
  std.os.windows: eliminate forwarder function in kernel32 (ziglang#25766)
@jedisct1 jedisct1 enabled auto-merge (squash) November 1, 2025 07:00
@jedisct1 jedisct1 merged commit 95c76b1 into ziglang:master Nov 1, 2025
9 checks passed
@jacobly0
Copy link
Member

jacobly0 commented Nov 1, 2025

These tests are failing randomly on aarch64.

@jedisct1
Copy link
Contributor Author

jedisct1 commented Nov 2, 2025

Maybe bf90825 fixes this?

Do you have a CI job where this failure occurred? I ran it locally for 8 hours straight and didn’t see any failing tests.

@jacobly0
Copy link
Member

jacobly0 commented Nov 2, 2025

While the crashes appear to not be your fault (current working theory is a qemu bug triggered by the less common instructions required to use a very large stack frame), I believe the fix is just an all-around improvement.

Suggested fix:

--- a/lib/std/crypto/kangarootwelve.zig
+++ b/lib/std/crypto/kangarootwelve.zig
@@ -230,7 +230,7 @@ fn keccakP1600timesN(comptime N: usize, states: *[5][5]@Vector(N, u64)) void {
         break :blk offsets;
     };
 
-    inline for (RC) |rc| {
+    for (&RC) |rc| {
         // θ (theta)
         var C: [5]@Vector(N, u64) = undefined;
         inline for (0..5) |x| {

Affect on stack frame size:
x86_64 Debug: 105-156KB → 8-13KB
aarch64 Debug: 92KB → 9KB
x86_64 ReleaseFast: 128-256B → 64-128B
aarch64 ReleaseFast: 480B → 96B

Affect on runtime performance:
x86_64 Debug:

  blake3-parallel:       1658 MiB/s
   kt128-parallel:       1097 MiB/s
   kt256-parallel:        765 MiB/s

  blake3-parallel:       1633 MiB/s
   kt128-parallel:       1296 MiB/s
   kt256-parallel:        869 MiB/s

aarch64 Debug:

  blake3-parallel:       1759 MiB/s
   kt128-parallel:        431 MiB/s
   kt256-parallel:        257 MiB/s

  blake3-parallel:       1790 MiB/s
   kt128-parallel:        533 MiB/s
   kt256-parallel:        296 MiB/s

x86_64 ReleaseFast (within run-to-run variance):

  blake3-parallel:      24766 MiB/s
   kt128-parallel:      29385 MiB/s
   kt256-parallel:      25808 MiB/s

  blake3-parallel:      25104 MiB/s
   kt128-parallel:      30526 MiB/s
   kt256-parallel:      25975 MiB/s

aarch64 ReleaseFast (within run-to-run variance):

  blake3-parallel:      27180 MiB/s
   kt128-parallel:      19073 MiB/s
   kt256-parallel:      13376 MiB/s

  blake3-parallel:      28696 MiB/s
   kt128-parallel:      20162 MiB/s
   kt256-parallel:      13841 MiB/s

My analysis for why removing inline improves runtime speed is that the reduced code size/stack usage improves cache utilization, and the extra instructions required for loop bookkeeping are more than hidden by the long latency instructions within the loop that are running on a different execution unit. I can let you make the change so that you can verify that my benchmark results are reproducible and in case you made other similar inline pessimizations.


So far I have not seen this happen on CI, but I was able to reproduce this ~10 times in keccakP1600timesN yesterday. After applying the above change I have so far only managed to reproduce it in keccakP instead which appears to follow the same pattern. Latest repro was with the following command on a overloaded non-aarch64 system (three other non-filtered test-std commands running concurrently).

$ zig build test-std -fqemu --libc-runtimes ../libc -Dtest-target-filter=aarch64 -Dtest-filter=kangarootwelve

@jacobly0
Copy link
Member

jacobly0 commented Nov 3, 2025

Well I just got this, doesn't seem related to the other bug I've been debugging for two days.

└─ run test std-mipsel-linux-musleabi-mips32r2-Debug-libc 2922 pass, 71 skip, 1 fail (2994 total)
error: 'crypto.kangarootwelve.test.KT256 sequential and parallel produce same output for large inputs' failed:
       slices differ. first difference occurs at index 0 (0x0)
       
       ============ expected this output: =============  len: 64 (0x40)
       
       3D 59 C5 52 70 78 DB 85  67 5B 16 56 46 D8 AB 81  =Y.Rpx..g[.VF...
       5F DC 78 8F EE 18 88 EA  06 F0 42 81 02 F1 48 E4  _.x.......B...H.
       74 C1 17 A4 B1 38 90 B1  A1 84 33 10 89 9E 05 3D  t....8....3....=
       8C 86 31 15 10 B4 05 C4  73 94 93 78 59 65 A8 0B  ..1.....s..xYe..
       
       ============= instead found this: ==============  len: 64 (0x40)
       
       7E 25 23 C7 FA 00 06 83  38 8B 71 EF 0E 7B 98 27  ~%#.....8.q..{.'
       CD 5A 39 42 37 E0 28 9F  D4 54 81 0D 35 FA C5 F1  .Z9B7.(..T.␍5...
       9E E9 68 1D 0F 8C 68 B7  BD F1 85 26 10 1A 4D 64  ..h...h....&..Md
       36 21 9C 8F 43 47 68 AD  0E 75 F9 FE AE 3C 3A 56  6!..CGh..u...<:V
       
       ================================================

@jedisct1
Copy link
Contributor Author

jedisct1 commented Nov 3, 2025

Thanks Jacob!

Unrolling twice is what seems to provide the best performance on x86_64 and aarch64, and aligns with what we already do for regular SHA3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants