Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LM-opencl benchmark reports unrealistic speeds on small devices #4871

Closed
solardiz opened this issue Nov 11, 2021 · 18 comments · Fixed by #4878
Closed

LM-opencl benchmark reports unrealistic speeds on small devices #4871

solardiz opened this issue Nov 11, 2021 · 18 comments · Fixed by #4878
Labels

Comments

@solardiz
Copy link
Member

Matus UHLAR reported this to me via private e-mail after I questioned the benchmark results he posted on the wiki (for several small devices, not only NVIDIA):

uhlar@fhome% ~/src/debian/john/bleeding-jumbo/run/john -test -format=descrypt-opencl
Device 1: GeForce GT 1030
Benchmarking: descrypt-opencl, traditional crypt(3) [DES OpenCL/mask
accel]... LWS=64 GWS=16384
Note: Building per-salt kernels. This takes e.g. 2 hours for 4096 salts.
DONE
Warning: "Many salts" test limited: 19/256
Many salts:     40267K c/s real, 40067K c/s virtual
Only one salt:  36687K c/s real, 36687K c/s virtual
 
uhlar@fhome% ~/src/debian/john/bleeding-jumbo/run/john -test -format=LM-opencl
Device 1: GeForce GT 1030
Benchmarking: LM-opencl [DES BS OpenCL/mask accel]... LWS=96 GWS=65536
DONE
Raw:    26336M c/s real, 25323M c/s virtual
uhlar@fhome% ~/src/debian/john/bleeding-jumbo/run/john -test=15 -format=descrypt-opencl
Device 1: GeForce GT 1030
Benchmarking: descrypt-opencl, traditional crypt(3) [DES OpenCL/mask
accel]... LWS=64 GWS=16384
Note: Building per-salt kernels. This takes e.g. 2 hours for 4096 salts.
DONE
Warning: "Many salts" test limited: 136/256
Many salts:     38417K c/s real, 38443K c/s virtual
Only one salt:  36157K c/s real, 36109K c/s virtual
 
uhlar@fhome% ~/src/debian/john/bleeding-jumbo/run/john -test=15 -format=LM-opencl
Device 1: GeForce GT 1030
Benchmarking: LM-opencl [DES BS OpenCL/mask accel]... LWS=128 GWS=65536
DONE
Raw:    25363M c/s real, 24427M c/s virtual
 
uhlar@fhome% ~/src/debian/john/bleeding-jumbo/run/john -test=60 -format=LM-opencl
Device 1: GeForce GT 1030
Benchmarking: LM-opencl [DES BS OpenCL/mask accel]... LWS=96 GWS=65536
DONE
Raw:    25432M c/s real, 24475M c/s virtual

Actual cracking:

uhlar@fhome% ~/src/debian/john/bleeding-jumbo/run/john -session:lm -format=LM-opencl -mask='?a?a?a?a?a?a' lm
Device 1: GeForce GT 1030
Using default input encoding: UTF-8
Using default target encoding: CP850
Loaded 2 password hashes with no different salts (LM-opencl [DES BS OpenCL])
Remaining 1 password hash
LWS=96 GWS=524288
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:43 6.34% (ETA: 18:25:30) 0g/s 159005Kp/s 159005Kc/s 159005KC/s
A9>#7O..A9>#7O
V#-MNU           (1)
1g 0:00:03:37 DONE (2021-10-30 18:17) 0.004594g/s 159207Kp/s 159207Kc/s
159207KC/s V#-MNU..AZ<_OU
Use the "--show --format=LM-opencl" options to display all of the cracked
passwords reliably
Session completed.

So descrypt-opencl benchmark speed is sane and lm-opencl's actual cracking speed is even somewhat low for it (could be ~1G, achieved ~160M), but the lm-opencl benchmark speed is insane for this small device (~25G). I am puzzled by this.

This is sort of the opposite of #4381, but I guess the underlying causes of these two issues are independent.

@solardiz solardiz added the bug label Nov 11, 2021
@solardiz solardiz added this to the Potentially 1.9.0-jumbo-2 milestone Nov 11, 2021
@fanto666
Copy link

these all happened on debian 11 amd64 installation with debian-provided packages:
ocl-icd-opencl-dev 2.2.14-2
ocl-icd-libopencl1 2.2.14-2
opencl-c-headers 3.02020.12.18
opencl-clhpp-headers 3.0
2.0.13-1
and either
nvidia-opencl-common 460.91.03-1
nvidia-opencl-icd 460.91.03-1
or
intel-opencl-icd 20.44.18297-1

john compiled with gcc-10.

no foreign development packages were installed.

@solardiz
Copy link
Member Author

I am able to reproduce this with Intel's OpenCL using a CPU or Intel HD Graphics. With this debugging printf:

+++ b/src/opencl_lm_b_plug.c
@@ -1295,6 +1295,7 @@ static int lm_crypt(int *pcount, struct db_salt *salt)
                BENCH_CLERROR(clEnqueueWriteBuffer(queue[gpu_id], buffer_hash_ids, CL_TRUE, 0, sizeof(cl_uint), zero_buffer, 0, NULL
        }
 
+       printf("%d %d %d\n", mask_mode, *pcount, mask_int_cand.num_int_cand);
        *pcount *= mask_int_cand.num_int_cand;

When running benchmark (with mask), I am getting (on CPU):

LWS=256 GWS=8192
0 262144 676

during actual cracking (with the same mask of ?a?a?l?u?d?d?s that benchmark uses), I am getting:

LWS=128 GWS=32768
1 32768 676

The most relevant difference appears to be that during benchmark the original *pcount is multiplied by 32 (bitslice depth) relative to GWS, whereas during actual cracking it is not. Such multiplication actually makes sense: GWS is the number of work items whereas each of those is supposed to process 32 candidate passwords in parallel in the different bit positions of 32-bit integers. Yet we're getting sane behavior and speeds during actual cracking without such multiplication. That's puzzling.

Without mask, benchmark (-test -no-mask):

LWS=256 GWS=131072
0 4194304 1

Actual cracking (-inc -length=7):

LWS=128 GWS=524288
0 16777216 1

Here the multiplication by 32 happens in both cases.

While I am still puzzled by what happens with mask, I do have some observations:

opencl_lm_b_plug.c: lm_crypt() doesn't actually use *pcount - it only updates it. I guess it uses the global GWS instead, always computing for the maximum number of candidate passwords that it's been tuned for. This feels wrong to me - likely occasional use of uninitialized values? This is different from what we have in descrypt-opencl, which does use *pcount in at least some way.

Also, this scaling:

        *pcount *= mask_int_cand.num_int_cand;

is unconditional in lm-opencl, but is conditional on mask_mode in descrypt-opencl. Making it conditional on mask_mode in lm-opencl would result in us reporting lower than actual benchmark speeds, because that flag is never set in lm-opencl benchmark. I wonder why the difference between these two formats.

We have many instances of these in mask-aware formats:

#if SIZEOF_SIZE_T > 4
        /* We can't process more than 4G keys per crypt() */
        while (gws_limit * mask_int_cand.num_int_cand > 0xffffffffUL)
                gws_limit >>= 1;
#endif  

However, our crypt_all still uses int for *pcount and for its return value. So by allowing for up to 4G rather than 2G there, we accept triggering int overflows, which we then hopefully correct with implicit typecasts/promotions e.g. in the call to status_update_crypts:

        count = crk_key_index;
        match = crk_methods.crypt_all(&count, salt);
        crk_last_key = count;

        status_update_crypts((uint64_t)salt->count * count, count);

However, formally this is UB, and we probably shouldn't allow values this high or should update that function's prototype.

@solardiz
Copy link
Member Author

In descrypt-opencl benchmark with mask, its mask_mode flag is actually set. There are several differences in the implementations of reset() between these, but the relevant one appears to be that in lm-opencl we only set the flag when db->real && db == db->real:

        if (db->real && db == db->real) {
[...]
                if (options.flags & FLG_MASK_CHK) {
                        mask_mode = 1;

Perhaps that's wrong.

@solardiz
Copy link
Member Author

This fixes the issue for me:

+++ b/src/opencl_lm_b_plug.c
@@ -1158,7 +1158,7 @@ static char *get_key_mm(int index)
 
 static void reset(struct db_main *db)
 {
-       if (db->real && db == db->real) {
+       if (!self_test_running) {
                struct db_salt *salt;
                unsigned int *bitmaps = NULL;
                OFFSET_TABLE_WORD *offset_table = NULL;

@fanto666 Can you test this patch, please?

I am still puzzled about the benchmark results looking realistic on larger devices. We'll need to also see the effect of the above change on those.

@solardiz
Copy link
Member Author

I am still puzzled about the benchmark results looking realistic on larger devices. We'll need to also see the effect of the above change on those.

On Titan X Maxwell, this changes the benchmark speed from ~6000M to ~4400M. The latter is similar to what I'm getting in actual cracking with mask.

@magnumripper
Copy link
Member

magnumripper commented Nov 16, 2021

This fixes the issue for me

I see no immediate problem with that change but it doesn't address some of the other weirdnesses. This is particularly strange:

	const int count = mask_mode ?
		*pcount : (*pcount + LM_DEPTH - 1) >> LM_LOG_DEPTH;

Why would mask_mode ever be a parameter in that? I can't believe mask_mode == running bitsliced (or not), or is it really?

@solardiz
Copy link
Member Author

Of course, the code is weird, and lacks comments.

	const int count = mask_mode ?
		*pcount : (*pcount + LM_DEPTH - 1) >> LM_LOG_DEPTH;

That line is dead code - count is never used. (However, descrypt-opencl has similar, and actually uses the computed count.)

Why would mask_mode ever be a parameter in that? I can't believe mask_mode == running bitsliced (or not), or is it really?

There's no code to run non-bitsliced. However, apparently when running with a mask, the bit depths are filled using the mask multiplier. This probably means we have efficiency loss when the mask multiplier isn't a multiple of 32 (which it usually isn't). For example, for the 676 seen in our default benchmark mask, the actual number of hashes computed is probably 704, and if so 28 hash computations or almost 4% of total are wasted. I didn't verify this and don't recall past discussions of it - but it's the only plausible explanation I have of what I saw in @sayan1an's host code. Despite of this wastage, it might be the most efficient way to implement mask in there (considering locality of reference).

@fanto666
Copy link

retried with the one-line patch from
#4871 (comment)

% ~/src/debian/john/bleeding-jumbo/run/john -test -format=LM-opencl
Device 1: GeForce GT 1030
Benchmarking: LM-opencl [DES BS OpenCL/mask accel]... LWS=192 GWS=65536
DONE
Raw: 883166K c/s real, 843562K c/s virtual

breaking random hash gives different results:

% ~/src/debian/john/bleeding-jumbo/run/john -format=LM-opencl -session:lm1 -mask='?a?a?a?a?a?a?a?a?a?a?a?a?a?a' lm1
Device 1: GeForce GT 1030
Using default input encoding: UTF-8
Using default target encoding: CP850
Loaded 2 password hashes with no different salts (LM-opencl [DES BS OpenCL])
LWS=96 GWS=524288
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:19 0.04% (ETA: 06:18:26) 0g/s 160619Kp/s 160619Kc/s 335102KC/s AT\V11A..AT\V11A
0g 0:00:01:02 0.14% (ETA: 06:31:59) 0g/s 164541Kp/s 164541Kc/s 329083KC/s AY8O.RA..AY8O.RA
Session aborted

I can try on other GPUs if needed

@solardiz
Copy link
Member Author

Thanks @fanto666. This remaining discrepancy is weird, yet the -test speed you report now is sane - it is consistent with what you're getting for descrypt-opencl (by design, LM is supposed to be about 25 times faster).

BTW, there's no point in using a mask longer than 7 with LM hashes since we're cracking their halves. Can you try with a length 7 mask? In particular, can you try with ?a?a?l?u?d?d?s?s, which is the mask used by -test.

Also, can you try actual cracking with descrypt-opencl?

Overall, I think my fix is right and it has worked, and there's something else causing the remaining discrepancy.

@solardiz
Copy link
Member Author

@magnumripper BTW, shouldn't we be rejecting attempted use of a mask longer than the format's maximum?

@fanto666
Copy link

% ~/src/debian/john/bleeding-jumbo/run/john -format=LM-opencl -session:lm1 -mask='?a?a?a?a?a?a?a' lm1
Device 1: GeForce GT 1030
Using default input encoding: UTF-8
Using default target encoding: CP850
Loaded 2 password hashes with no different salts (LM-opencl [DES BS OpenCL])
LWS=96 GWS=524288
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:01:30 0.20% (ETA: 09:13:32) 0g/s 164399Kp/s 164399Kc/s 328798KC/s AS$?F0A..AS$?F0A
0g 0:00:01:40 0.22% (ETA: 09:20:59) 0g/s 162791Kp/s 162791Kc/s 325582KC/s A-#MP2A..A-#MP2A
Session aborted

% ~/src/debian/john/bleeding-jumbo/run/john -format=LM-opencl -session:lm1 -mask='?a?a?l?u?d?d?s?s' lm1
Device 1: GeForce GT 1030
Using default input encoding: UTF-8
Using default target encoding: CP850
Loaded 2 password hashes with no different salts (LM-opencl [DES BS OpenCL])
LWS=192 GWS=65536
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:12 DONE (2021-11-17 20:41) 0g/s 885069Kp/s 885069Kc/s 1770MC/s A?AJ46|..A?AJ46|
Session completed.

% ~/src/debian/john/bleeding-jumbo/run/john -format=LM-opencl -session:lm1 -mask='?a?a?a?a?a?a?a?a' lm1
Device 1: GeForce GT 1030
Using default input encoding: UTF-8
Using default target encoding: CP850
Loaded 2 password hashes with no different salts (LM-opencl [DES BS OpenCL])
LWS=96 GWS=524288
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:49 0.11% (ETA: 09:11:53) 0g/s 165375Kp/s 165375Kc/s 330750KC/s AP3XMNA..AP3XMNA
0g 0:00:00:53 0.12% (ETA: 09:25:24) 0g/s 162450Kp/s 162450Kc/s 324900KC/s A:%XNA..A:%XNA
Session aborted

@fanto666
Copy link

for DES I get about the same numbers:

% ~/src/debian/john/bleeding-jumbo/run/john -format=descrypt-opencl -session:des -mask='?a?a?a?a?a?a?a?a' des
Device 1: GeForce GT 1030
Using default input encoding: UTF-8
Loaded 1 password hash (descrypt-opencl, traditional crypt(3) [DES OpenCL])
LWS=64 GWS=65536
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:33 0.00% (ETA: 2027-08-27 16:07) 0g/s 36192Kp/s 36192Kc/s 36192KC/s a!C,caaa..a!C,caaa

uhlar@fhome% ~/src/debian/john/bleeding-jumbo/run/john -format=descrypt-opencl -session:des -mask='?a?a?a?a?a?a?a' des
Device 1: GeForce GT 1030
Using default input encoding: UTF-8
Loaded 1 password hash (descrypt-opencl, traditional crypt(3) [DES OpenCL])
LWS=64 GWS=65536
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:54 0.00% (ETA: 2021-12-10 02:54) 0g/s 36237Kp/s 36237Kc/s 36237KC/s aPOlbaa..aPOlbaa

% ~/src/debian/john/bleeding-jumbo/run/john -format=descrypt-opencl -session:des -mask='?a?a?l?u?d?d?s?s' des
Device 1: GeForce GT 1030
Using default input encoding: UTF-8
Loaded 1 password hash (descrypt-opencl, traditional crypt(3) [DES OpenCL])
LWS=64 GWS=16384
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:30 0.16% (ETA: 01:55:06) 0g/s 35521Kp/s 35521Kc/s 35521KC/s 2aJ14_... 2aJ14_.

@solardiz
Copy link
Member Author

Thanks. So the fix here appears to be working right, and there's a separate issue with lm-opencl having much worse performance for ?a?a?a?a?a?a?a?a than it does for our benchmark mask. Per my testing yesterday, the latter mask (which I also tried) with lm-opencl results in a multiplier of 95, vs. 676 for the benchmark mask. Maybe 95 is too small, or maybe the tuned GWS is too high. Speaking of the latter, can you try forcing LWS=192 GWS=65536 like this:

LWS=192 GWS=65536 ~/src/debian/john/bleeding-jumbo/run/john -format=LM-opencl -session:lm1 -mask='?a?a?a?a?a?a?a?a' lm1

FWIW, in my testing on Titan X Maxwell, these two masks had similar speeds - IIRC, about 4400M for the benchmark mask, and about 4500M for the all-?a mask - with this difference actually making sense if we consider 676/704 = ~96% vs. 95/96 = ~99%.

@magnumripper
Copy link
Member

@magnumripper BTW, shouldn't we be rejecting attempted use of a mask longer than the format's maximum?

Current behavior is to silently truncate any too long mask (or silently stretch a too short one, where applicable). For warning or even refusing them we need to know that it's an explicitly given one (as opposed to a default one). Maybe that's trivial, maybe not. If you think we really should, please open an issue.

@magnumripper
Copy link
Member

Per my testing yesterday, the latter mask (which I also tried) with lm-opencl results in a multiplier of 95, vs. 676 for the benchmark mask.

Please note that the "mask multiplier target" depends on GPU horse power. For LM it's

	mask_int_cand_target = opencl_speed_index(gpu_id) / 300;

...where the "speed index" is basically the device's clock multiplied with number of cores and SIMD width. You can see it with --list=opencl-devices. For a 2080ti (4352 CUDA cores, 1545 MHz) the figure is 6723840 so mask_int_cand_target ends up being 22412 while weaker devices might end up with a way lower figure such as 1680 (AMD Radeon HD 6700).

The speed index is really a best-effort figure: Sometimes we only know the number of "Parallel compute cores" but not the number of "stream processors" per such core (I think we got it right for most AMD and nvidia GPU's though), this will make the figure too low. So it's not perfect by far, but it's the best we can do (I'm open for ideas).

Anyway, whatever the "mask multiplier target" ended up with is then used with some kind of fuzz, so for a complete mask of "?a?a?a?a?a?a?a" we might pick "?a?a" even for a much lower target than 9025 but at some point we do pick a single "?a". That's Sayantan's code and I don't know the details of it.

@solardiz
Copy link
Member Author

	const int count = mask_mode ?
		*pcount : (*pcount + LM_DEPTH - 1) >> LM_LOG_DEPTH;

That line is dead code - count is never used.

I was wrong - count is used. Anyway, this isn't the only place in that file where the multiplication/division by bitslice depth is conditional upon mask mode.

solardiz added a commit that referenced this issue Nov 18, 2021
@fanto666
Copy link

not sure if you are still interested, but here you are.
btw with different runs at different time I often get different guesses for LWS/GWS (even with the same parameters), but similar results.

% env LWS=192 GWS=65536 ~/src/debian/john/bleeding-jumbo/run/john -format=LM-opencl -session:lm1 -mask='?a?a?a?a?a?a?a?a' lm1
Device 1: GeForce GT 1030
Using default input encoding: UTF-8
Using default target encoding: CP850
Loaded 2 password hashes with no different salts (LM-opencl [DES BS OpenCL])
LWS=192 GWS=65536
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:22 0.05% (ETA: 06:21:40) 0g/s 161960Kp/s 161960Kc/s 330104KC/s ANV3U1A..ANV3U1A

@solardiz
Copy link
Member Author

FWIW, in my testing on Titan X Maxwell, these two masks had similar speeds - IIRC, about 4400M for the benchmark mask, and about 4500M for the all-?a mask - with this difference actually making sense if we consider 676/704 = ~96% vs. 95/96 = ~99%.

I might not have recalled correctly. I thought those figures were what I saw with debugging printf added like I mentioned in
#4871 (comment) - however, now I see I was getting much larger "mask factor" values in the log file - or maybe there's a reason for discrepancy between what I was printing there and what we're logging...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants