Skip to content

SHA-2 Maj, SHA-1 H, MD4 G trick #4727

Open
@solardiz

Description

@solardiz

I just found this old message:

https://groups.google.com/g/crypto-optimization/c/ZjtgKWg8AnA

Wei Dai
Mar 28, 2009, 3:15:43 AM
to crypto-op...@googlegroups.com
[This is a repost that was originally written for the NIST SHA-3 forum.]

Dan Bernstein wrote:
> There was some discussion of SHA-256 performance on sci.crypt in 2003.
> Eric Young said he had code using 2891 instructions/block and running at
> 15.53 cycles/byte on an Opteron. The consensus was that 2824 arithmetic
> instructions were necessary; that means at least 14.7 cycles/byte on any
> current CPU.

Thanks for pointing to that discussion. After reading it, I thought of a
possible optimization that improves the 2824 instructions bound by 2*63 to
2698, and the theoretical speed limit to 14.1 cpb. From the newsgroup post
(http://groups.google.com/group/sci.crypt/browse_thread/thread/84bfde56e8ac6047/):

> Maj(x,y,z)
>
> We can write Maj(x,y,z) = x ^ ((x ^ y) & (x ^ z))
>
> Maj() then requires 3 XOR and 1 AND. We also need to make
> a copy of either x or y and a copy of either x or z.

We can instead write Maj(x,y,z) = y ^ ((x ^ y) & (y ^ z)). Then after
computing x^y, cache it until the next round, where it can be reused as y^z.
This reduces the number of XOR by 1, and the number of copies by 1, for
every round except the first

So we can write Maj in 4 basic operations (with a total circuit depth or latency of 3, though), and moreover with Wei Dai's trick have a common subexpression that would amortize this to 3 basic operations.

When we have at least bitselect() or the like, we don't need this, but when we don't we could give this a try.

Right now, when we don't use bitselect(), etc., we use this in C:

sha2.c: maj = (b & c) ^ (b & d) ^ (c & d);
simd-intrinsics.c:      maj = (b & c) ^ (b & d) ^ (c & d);
simd-intrinsics.c:#define Maj(x,y,z) vor(vand(x, y), vand(vor(x, y), z))

This is either 5 operations (but with depth of only 2, and there might be common subexpressions between steps - I didn't check) or 4 operations (with a depth of 3, and again I didn't check for common subexpressions). Maybe they can be reduced to amortized 3 operations (unless this is already achieved).

We use this in OpenCL:

opencl_sha2.h:#define Maj(x, y, z) ((x & y) | (z & (x | y)))
opencl_sha2_common.h:#define Maj(x, y, z) ((x & y) | (z & (x | y)))
pwsafe_kernel.cl:#define Maj(x, y, z) ((x & y) | (z & (x | y)))

This is the same as the 4 operations version in simd-intrinsics.c.

Why do we choose to write this differently in scalar vs. SIMD/OpenCL? Do we prioritize instruction-level parallelism over number of operations in scalar code (since have more execution units yet only one data stream)? This could actually make sense, but is worth testing both ways and with alternatives from this posting I referenced, and worth writing source code comments on (with the alternatives mentioned).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions