-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SHA-2 Maj, SHA-1 H, MD4 G trick #4727
Comments
Oh, found that we actually mostly use the same 4 operations version in scalar code as well:
The 5 operations one is in So my main point is we could want to try these two alternative 4 operations versions from the posting I quoted. |
So this gives me slightly smaller code and about 1% speedup at sha256crypt, sha512crypt, Bitcoin on AVX2: +++ b/src/simd-intrinsics.c
@@ -1783,7 +1783,7 @@ void SIMDSHA1body(vtype* _data, uint32_t *out, uint32_t *reload_state,
#elif !VCMOV_EMULATED
#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
#else
-#define Maj(x,y,z) vor(vand(x, y), vand(vor(x, y), z))
+#define Maj(x,y,z) vxor(y, vand(vxor(x, y), vxor(y, z)))
#endif
#define Ch(x,y,z) vcmov(y, z, x)
@@ -2418,6 +2418,7 @@ void SIMDSHA256body(vtype *data, uint32_t *out, uint32_t *reload_state, unsigned
)
#endif
+#if 0
#ifdef vternarylogic
#define Maj(x,y,z) vternarylogic(x, y, z, 0xE8)
#elif !VCMOV_EMULATED
@@ -2427,6 +2428,7 @@ void SIMDSHA256body(vtype *data, uint32_t *out, uint32_t *reload_state, unsigned
#endif
#define Ch(x,y,z) vcmov(y, z, x)
+#endif
#define SHA512_PARA_DO(x) for (x = 0; x < SIMD_PARA_SHA512; ++x)
As you can see, I simply Need to also update OpenCL and scalar code, and test this more. |
Looks like there were two files merged to one. Perhaps optimal SHA-512 macros could end up different than SHA-256 on some arch, but then they'd need to be undefined first. |
Maybe there's a reason why Wei Dai called this "SHA Maj trick", not "SHA-2 Maj trick": SHA-1's H() is the same as SHA-2's Maj(). However, trying to apply this trick to our SHA-1 for AVX2 actually results in slight code size increase without an obvious effect on speed on Haswell. This is without explicit caching/reuse - letting the compiler possibly do it. We could want to retest this across several compiler versions. BTW, the last time we seriously discussed related issues appears to be this thread: https://www.openwall.com/lists/john-dev/2015/09/02/5 What I don't understand is why we choose to explicitly save a temporary value (that we only use once) e.g. in (and in many other such macros): #define SHA1_H(x,y,z) \
tmp[i] = vand((x[i]),(y[i])); \
tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i]))); BTW, my attempted alternatives to it were: #define SHA1_H(x,y,z) \
tmp[i] = vxor((y[i]), vand(vxor((x[i]), (y[i])), vxor((y[i]), z[i]))); and: #define SHA1_H(x,y,z) \
tmp[i] = vxor((x[i]), (y[i])); \
tmp[i] = vxor((y[i]), vand((tmp[i]), vxor((y[i]), z[i]))); I also don't understand why we put the extra braces around As mentioned in that john-dev thread I referenced, we can/should also try re-ordering the arguments to Maj() and H() in all 6 possible ways to pick the fastest ordering. |
The code size increase is avoided by also reordering the arguments (e.g. to |
Looks like explicit caching is sometimes needed. Experimenting with |
That thread lead to #1727
...and revisiting #1727 I was reminded |
Thanks, @magnumripper! I've just tried patching my generic Committed SHA-256 changes into yescrypt (not yet updated in JtR) and yespower: +++ sha256.c
@@ -101,7 +101,11 @@ static const uint32_t Krnd[64] = {
/* Elementary functions used by SHA256 */
#define Ch(x, y, z) ((x & (y ^ z)) ^ z)
-#define Maj(x, y, z) ((x & (y | z)) | (y & z))
+#if 1 /* Explicit caching/reuse of common subexpression between rounds */
+#define Maj(x, y, z) (y ^ ((x_xor_y = x ^ y) & y_xor_z))
+#else /* Let the compiler cache/reuse or not */
+#define Maj(x, y, z) (y ^ ((x ^ y) & (y ^ z)))
+#endif
#define SHR(x, n) (x >> n)
#define ROTR(x, n) ((x >> n) | (x << (32 - n)))
#define S0(x) (ROTR(x, 2) ^ ROTR(x, 13) ^ ROTR(x, 22))
@@ -113,7 +117,8 @@ static const uint32_t Krnd[64] = {
#define RND(a, b, c, d, e, f, g, h, k) \
h += S1(e) + Ch(e, f, g) + k; \
d += h; \
- h += S0(a) + Maj(a, b, c);
+ h += S0(a) + Maj(a, b, c); \
+ y_xor_z = x_xor_y;
/* Adjusted round function for rotating state */
#define RNDr(S, W, i, ii) \
@@ -146,6 +151,7 @@ SHA256_Transform(uint32_t state[static restrict 8],
/* 3. Mix. */
for (i = 0; i < 64; i += 16) {
+ uint32_t x_xor_y, y_xor_z = S[(65 - i) % 8] ^ S[(66 - i) % 8];
RNDr(S, W, 0, i);
RNDr(S, W, 1, i);
RNDr(S, W, 2, i); Experimental MD4 changes: +++ md4.c 2021-05-18 21:14:11 +0200
@@ -48,7 +48,11 @@
* optimization for F borrowed from Colin Plumb's MD5 implementation.
*/
#define F(x, y, z) ((z) ^ ((x) & ((y) ^ (z))))
-#define G(x, y, z) (((x) & ((y) | (z))) | ((y) & (z)))
+#if 1 /* Explicit caching/reuse of common subexpression between rounds */
+#define G(x, y, z) (r = ((y) ^ ((n = (x) ^ (y)) & p)), p = n, r)
+#else /* Let the compiler cache/reuse or not */
+#define G(x, y, z) ((y) ^ (((x) ^ (y)) & ((y) ^ (z))))
+#endif
#define H(x, y, z) ((x) ^ (y) ^ (z))
/*
@@ -96,7 +100,7 @@
static const void *body(MD4_CTX *ctx, const void *data, unsigned long size)
{
const unsigned char *ptr;
- MD4_u32plus a, b, c, d;
+ MD4_u32plus a, b, c, d, n, p, r;
MD4_u32plus saved_a, saved_b, saved_c, saved_d;
const MD4_u32plus ac1 = 0x5a827999, ac2 = 0x6ed9eba1;
@@ -132,6 +136,7 @@
STEP(F, b, c, d, a, SET(15), 19)
/* Round 2 */
+ p = c ^ d;
STEP(G, a, b, c, d, GET(0) + ac1, 3)
STEP(G, d, a, b, c, GET(4) + ac1, 5)
STEP(G, c, d, a, b, GET(8) + ac1, 9) For MD4, I think we'll need to keep the old |
This patch has been cherry-picked from: openwall/yescrypt@9edf51061b45
This commit just adds the code; does not enable it. But it does add LUT3 for pwsafe-opencl on nvidias - that was missing! See openwall#4727
This commit just adds the code; does not enable it. But it does add LUT3 for pwsafe-opencl on nvidias - that was missing! See #4727
It is now. We should also make similar changes to |
I just found this old message:
https://groups.google.com/g/crypto-optimization/c/ZjtgKWg8AnA
So we can write
Maj
in 4 basic operations (with a total circuit depth or latency of 3, though), and moreover with Wei Dai's trick have a common subexpression that would amortize this to 3 basic operations.When we have at least
bitselect()
or the like, we don't need this, but when we don't we could give this a try.Right now, when we don't use
bitselect()
, etc., we use this in C:This is either 5 operations (but with depth of only 2, and there might be common subexpressions between steps - I didn't check) or 4 operations (with a depth of 3, and again I didn't check for common subexpressions). Maybe they can be reduced to amortized 3 operations (unless this is already achieved).
We use this in OpenCL:
This is the same as the 4 operations version in
simd-intrinsics.c
.Why do we choose to write this differently in scalar vs. SIMD/OpenCL? Do we prioritize instruction-level parallelism over number of operations in scalar code (since have more execution units yet only one data stream)? This could actually make sense, but is worth testing both ways and with alternatives from this posting I referenced, and worth writing source code comments on (with the alternatives mentioned).
The text was updated successfully, but these errors were encountered: