Add std.crypto.hash.sha3.{KT128,KT256} #25593

jedisct1 · 2025-10-15T23:13:03Z

KT128 and KT256 are fast, secure cryptographic hash functions based on Keccak (SHA-3).

They can be seen as the modern version of SHA-3, and an evolution of SHAKE, with better performance.

After the SHA-3 competition, the Keccak team proposed these variants in 2016, and the constructions underwent 8 years of public scrutiny before being standardized in October 2025 as RFC 9861.

They uses a tree-hashing mode on top of TurboSHAKE, providing both high security and excellent performance, especially on large inputs.

They support arbitrary-length output and optional customization strings.

Hashing of very large inputs can be done using multiple threads, for high throughput.

KT128 provides 128-bit security strength, equivalent to AES-128 and SHAKE128, which is sufficient for virtually all applications.

KT256 provides 256-bit security strength.
For small inputs, TurboSHAKE128 and TurboSHAKE256 (which KT128 and KT256 are based on) can be used instead as they have less overhead.

jedisct1 · 2025-10-15T23:27:38Z

Lazy copy/paste of the benchmarks from the parallel BLAKE3 PR:

Apple M1

==========================================================================================================
SUMMARY TABLE - All throughput values in MB/s
==========================================================================================================
Size         Chunks |      SHA256      BLAKE3  BLAKE3-Par  TurboSH128   KT128-Seq   KT128-Par
---------- -------- + ----------- ----------- ----------- ----------- ----------- -----------
64 B              1 |     1172.13      574.66       60.64      506.88      356.59      356.00
1 KB              1 |     2179.96      747.71      440.60     1262.64      598.69      598.90
8 KB              1 |     2294.30     1465.83     1241.32     1452.02     1337.35     1150.79
64 KB             8 |     2311.54     1508.80     1471.13     1464.32     1713.64     1665.33
1 MB            128 |     2309.67     1511.87     1504.40     1453.29     2572.95     2561.34
10 MB          1280 |     2307.63     1509.54     5229.25     1442.97     2626.67     9356.03
100 MB        12800 |     2310.07     1508.31     7632.71     1443.38     2643.04    12152.51
200 MB        25600 |     2311.98     1509.60     8419.46     1443.25     2601.17    13479.36
==========================================================================================================

AMD Zen4

==========================================================================================================
SUMMARY TABLE - All throughput values in MB/s
==========================================================================================================
Size         Chunks |      SHA256      BLAKE3  BLAKE3-Par  TurboSH128   KT128-Seq   KT128-Par
---------- -------- + ----------- ----------- ----------- ----------- ----------- -----------
64 B              1 |      878.24      523.53       97.42      395.50      293.91      295.34
1 KB              1 |     1486.90      720.74      521.66      931.96      477.53      478.87
8 KB              1 |     1553.39     3691.62     2924.73     1070.49      993.06      919.36
64 KB             8 |     1566.99     5020.52     4800.96     1075.00     1681.94     1656.39
1 MB            128 |     1565.47     5133.86     5113.38     1073.80     4219.76     4204.44
10 MB          1280 |     1561.68     5120.92     9344.03     1074.22     4627.68    11656.27
100 MB        12800 |     1563.46     3481.63    14390.99     1074.64     4560.84    24914.64
200 MB        25600 |     1563.00     3380.68    16670.07     1075.43     4557.86    26870.09
==========================================================================================================

KT128 and KT256 are the fastest cryptographic hash functions for large inputs.

However, tree hashing+threads can be pretty damaging to other applications in concurrent scenarios.

jedisct1 · 2025-10-16T22:22:41Z

Let's wait for #25592 to land.

KT128 and KT256 are fast, secure cryptographic hash functions based on Keccak (SHA-3). They can be seen as the modern version of SHA-3, and evolution of SHAKE, with better performance. After the SHA-3 competition, the Keccak team proposed these variants in 2016, and the constructions underwent 8 years of public scrutiny before being standardized in October 2025 as RFC 9861. They uses a tree-hashing mode on top of TurboSHAKE, providing both high security and excellent performance, especially on large inputs. They support arbitrary-length output and optional customization strings. Hashing of very large inputs can be done using multiple threads, for high throughput. KT128 provides 128-bit security strength, equivalent to AES-128 and SHAKE128, which is sufficient for virtually all applications. KT256 provides 256-bit security strength. For small inputs, TurboSHAKE128 and TurboSHAKE256 (which KT128 and KT256 are based on) can be used instead as they have less overhead.

* master: fix typo in std.debug.ElfFile.loadSeparateDebugFile Revert "ci: stop building FreeBSD module tests on x86_64-linux" Io: fix some horrible data races and UAFs caused by `Condition` misuse

* master: Implement threaded BLAKE3 (ziglang#25587) std: Skip element comparisons if `mem.order` args point to same memory std.Target: bump vulkan max version to 1.4.331 std.Target: bump opencl/nvcl max version to 3.0.19 std.Target: bump cuda max version to 13.0.2 std.Target: bump amdhsa max version to 7.1.0 std.Target: bump wasi max version to 0.3.0 std.Target: bump dragonfly max version to 6.4.2 std.Target: bump linux max version to 6.17 std.Target: bump fuchsia max version to 28.0.0 std.Target: bump contiki max version to 5.1.0 test: remove some unsupported x86_64 darwin targets from llvm_targets std.os.windows: eliminate forwarder function in kernel32 (ziglang#25766)

jacobly0 · 2025-11-01T15:51:59Z

These tests are failing randomly on aarch64.

jedisct1 · 2025-11-02T18:19:35Z

Maybe bf90825 fixes this?

Do you have a CI job where this failure occurred? I ran it locally for 8 hours straight and didn’t see any failing tests.

jacobly0 · 2025-11-02T19:26:19Z

While the crashes appear to not be your fault (current working theory is a qemu bug triggered by the less common instructions required to use a very large stack frame), I believe the fix is just an all-around improvement.

Suggested fix:

--- a/lib/std/crypto/kangarootwelve.zig
+++ b/lib/std/crypto/kangarootwelve.zig
@@ -230,7 +230,7 @@ fn keccakP1600timesN(comptime N: usize, states: *[5][5]@Vector(N, u64)) void {
         break :blk offsets;
     };
 
-    inline for (RC) |rc| {
+    for (&RC) |rc| {
         // θ (theta)
         var C: [5]@Vector(N, u64) = undefined;
         inline for (0..5) |x| {

Affect on stack frame size:
x86_64 Debug: 105-156KB → 8-13KB
aarch64 Debug: 92KB → 9KB
x86_64 ReleaseFast: 128-256B → 64-128B
aarch64 ReleaseFast: 480B → 96B

Affect on runtime performance:
x86_64 Debug:

  blake3-parallel:       1658 MiB/s
   kt128-parallel:       1097 MiB/s
   kt256-parallel:        765 MiB/s

↓

  blake3-parallel:       1633 MiB/s
   kt128-parallel:       1296 MiB/s
   kt256-parallel:        869 MiB/s

aarch64 Debug:

  blake3-parallel:       1759 MiB/s
   kt128-parallel:        431 MiB/s
   kt256-parallel:        257 MiB/s

↓

  blake3-parallel:       1790 MiB/s
   kt128-parallel:        533 MiB/s
   kt256-parallel:        296 MiB/s

x86_64 ReleaseFast (within run-to-run variance):

  blake3-parallel:      24766 MiB/s
   kt128-parallel:      29385 MiB/s
   kt256-parallel:      25808 MiB/s

↓

  blake3-parallel:      25104 MiB/s
   kt128-parallel:      30526 MiB/s
   kt256-parallel:      25975 MiB/s

aarch64 ReleaseFast (within run-to-run variance):

  blake3-parallel:      27180 MiB/s
   kt128-parallel:      19073 MiB/s
   kt256-parallel:      13376 MiB/s

↓

  blake3-parallel:      28696 MiB/s
   kt128-parallel:      20162 MiB/s
   kt256-parallel:      13841 MiB/s

My analysis for why removing inline improves runtime speed is that the reduced code size/stack usage improves cache utilization, and the extra instructions required for loop bookkeeping are more than hidden by the long latency instructions within the loop that are running on a different execution unit. I can let you make the change so that you can verify that my benchmark results are reproducible and in case you made other similar inline pessimizations.

So far I have not seen this happen on CI, but I was able to reproduce this ~10 times in keccakP1600timesN yesterday. After applying the above change I have so far only managed to reproduce it in keccakP instead which appears to follow the same pattern. Latest repro was with the following command on a overloaded non-aarch64 system (three other non-filtered test-std commands running concurrently).

$ zig build test-std -fqemu --libc-runtimes ../libc -Dtest-target-filter=aarch64 -Dtest-filter=kangarootwelve

jacobly0 · 2025-11-03T06:08:51Z

Well I just got this, doesn't seem related to the other bug I've been debugging for two days.

└─ run test std-mipsel-linux-musleabi-mips32r2-Debug-libc 2922 pass, 71 skip, 1 fail (2994 total)
error: 'crypto.kangarootwelve.test.KT256 sequential and parallel produce same output for large inputs' failed:
       slices differ. first difference occurs at index 0 (0x0)
       
       ============ expected this output: =============  len: 64 (0x40)
       
       3D 59 C5 52 70 78 DB 85  67 5B 16 56 46 D8 AB 81  =Y.Rpx..g[.VF...
       5F DC 78 8F EE 18 88 EA  06 F0 42 81 02 F1 48 E4  _.x.......B...H.
       74 C1 17 A4 B1 38 90 B1  A1 84 33 10 89 9E 05 3D  t....8....3....=
       8C 86 31 15 10 B4 05 C4  73 94 93 78 59 65 A8 0B  ..1.....s..xYe..
       
       ============= instead found this: ==============  len: 64 (0x40)
       
       7E 25 23 C7 FA 00 06 83  38 8B 71 EF 0E 7B 98 27  ~%#.....8.q..{.'
       CD 5A 39 42 37 E0 28 9F  D4 54 81 0D 35 FA C5 F1  .Z9B7.(..T.␍5...
       9E E9 68 1D 0F 8C 68 B7  BD F1 85 26 10 1A 4D 64  ..h...h....&..Md
       36 21 9C 8F 43 47 68 AD  0E 75 F9 FE AE 3C 3A 56  6!..CGh..u...<:V
       
       ================================================

jedisct1 · 2025-11-03T09:15:48Z

Thanks Jacob!

Unrolling twice is what seems to provide the best performance on x86_64 and aarch64, and aligns with what we already do for regular SHA3.

jedisct1 added 2 commits October 30, 2025 23:03

Adapt to the new IO system

8b94f9a

jedisct1 force-pushed the k12 branch from b7d87f7 to 8b94f9a Compare October 30, 2025 22:37

jedisct1 added 5 commits October 31, 2025 10:17

Merge branch 'master' into k12

be755b6

* master: fix typo in std.debug.ElfFile.loadSeparateDebugFile Revert "ci: stop building FreeBSD module tests on x86_64-linux" Io: fix some horrible data races and UAFs caused by `Condition` misuse

crypto/kangarootwelve: use 2MB as a cutoff point

4a628c2

Use std.testing.io

cdf26a9

Add crypto.hash.sha3.KT128 to the parallel_hashes benchmark

a101d4c

jedisct1 mentioned this pull request Nov 1, 2025

Blake3 implementation in std is slow #15375

Closed

jedisct1 enabled auto-merge (squash) November 1, 2025 07:00

jedisct1 merged commit 95c76b1 into ziglang:master Nov 1, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add std.crypto.hash.sha3.{KT128,KT256} #25593

Add std.crypto.hash.sha3.{KT128,KT256} #25593

Uh oh!

jedisct1 commented Oct 15, 2025 •

edited

Loading

Uh oh!

jedisct1 commented Oct 15, 2025

Uh oh!

jedisct1 commented Oct 16, 2025

Uh oh!

Uh oh!

jacobly0 commented Nov 1, 2025

Uh oh!

jedisct1 commented Nov 2, 2025

Uh oh!

jacobly0 commented Nov 2, 2025 •

edited

Loading

Uh oh!

jacobly0 commented Nov 3, 2025

Uh oh!

jedisct1 commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add std.crypto.hash.sha3.{KT128,KT256} #25593

Add std.crypto.hash.sha3.{KT128,KT256} #25593

Uh oh!

Conversation

jedisct1 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jedisct1 commented Oct 15, 2025

Apple M1

AMD Zen4

Uh oh!

jedisct1 commented Oct 16, 2025

Uh oh!

Uh oh!

jacobly0 commented Nov 1, 2025

Uh oh!

jedisct1 commented Nov 2, 2025

Uh oh!

jacobly0 commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobly0 commented Nov 3, 2025

Uh oh!

jedisct1 commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jedisct1 commented Oct 15, 2025 •

edited

Loading

jacobly0 commented Nov 2, 2025 •

edited

Loading