SHA-2 Maj, SHA-1 H, MD4 G trick

I just found this old message:

https://groups.google.com/g/crypto-optimization/c/ZjtgKWg8AnA

```
Wei Dai
Mar 28, 2009, 3:15:43 AM
to crypto-op...@googlegroups.com
[This is a repost that was originally written for the NIST SHA-3 forum.]

Dan Bernstein wrote:
> There was some discussion of SHA-256 performance on sci.crypt in 2003.
> Eric Young said he had code using 2891 instructions/block and running at
> 15.53 cycles/byte on an Opteron. The consensus was that 2824 arithmetic
> instructions were necessary; that means at least 14.7 cycles/byte on any
> current CPU.

Thanks for pointing to that discussion. After reading it, I thought of a
possible optimization that improves the 2824 instructions bound by 2*63 to
2698, and the theoretical speed limit to 14.1 cpb. From the newsgroup post
(http://groups.google.com/group/sci.crypt/browse_thread/thread/84bfde56e8ac6047/):

> Maj(x,y,z)
>
> We can write Maj(x,y,z) = x ^ ((x ^ y) & (x ^ z))
>
> Maj() then requires 3 XOR and 1 AND. We also need to make
> a copy of either x or y and a copy of either x or z.

We can instead write Maj(x,y,z) = y ^ ((x ^ y) & (y ^ z)). Then after
computing x^y, cache it until the next round, where it can be reused as y^z.
This reduces the number of XOR by 1, and the number of copies by 1, for
every round except the first
```

So we can write `Maj` in 4 basic operations (with a total circuit depth or latency of 3, though), and moreover with Wei Dai's trick have a common subexpression that would amortize this to 3 basic operations.

When we have at least `bitselect()` or the like, we don't need this, but when we don't we could give this a try.

Right now, when we don't use `bitselect()`, etc., we use this in C:

```
sha2.c: maj = (b & c) ^ (b & d) ^ (c & d);
simd-intrinsics.c:      maj = (b & c) ^ (b & d) ^ (c & d);
simd-intrinsics.c:#define Maj(x,y,z) vor(vand(x, y), vand(vor(x, y), z))
```

This is either 5 operations (but with depth of only 2, and there might be common subexpressions between steps - I didn't check) or 4 operations (with a depth of 3, and again I didn't check for common subexpressions). Maybe they can be reduced to amortized 3 operations (unless this is already achieved).

We use this in OpenCL:

```
opencl_sha2.h:#define Maj(x, y, z) ((x & y) | (z & (x | y)))
opencl_sha2_common.h:#define Maj(x, y, z) ((x & y) | (z & (x | y)))
pwsafe_kernel.cl:#define Maj(x, y, z) ((x & y) | (z & (x | y)))
```

This is the same as the 4 operations version in `simd-intrinsics.c`.

Why do we choose to write this differently in scalar vs. SIMD/OpenCL? Do we prioritize instruction-level parallelism over number of operations in scalar code (since have more execution units yet only one data stream)? This could actually make sense, but is worth testing both ways and with alternatives from this posting I referenced, and worth writing source code comments on (with the alternatives mentioned).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SHA-2 Maj, SHA-1 H, MD4 G trick #4727

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SHA-2 Maj, SHA-1 H, MD4 G trick #4727

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions