-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LM-opencl benchmark reports unrealistic speeds on small devices #4871
Comments
these all happened on debian 11 amd64 installation with debian-provided packages: john compiled with gcc-10. no foreign development packages were installed. |
I am able to reproduce this with Intel's OpenCL using a CPU or Intel HD Graphics. With this debugging +++ b/src/opencl_lm_b_plug.c
@@ -1295,6 +1295,7 @@ static int lm_crypt(int *pcount, struct db_salt *salt)
BENCH_CLERROR(clEnqueueWriteBuffer(queue[gpu_id], buffer_hash_ids, CL_TRUE, 0, sizeof(cl_uint), zero_buffer, 0, NULL
}
+ printf("%d %d %d\n", mask_mode, *pcount, mask_int_cand.num_int_cand);
*pcount *= mask_int_cand.num_int_cand; When running benchmark (with mask), I am getting (on CPU):
during actual cracking (with the same mask of
The most relevant difference appears to be that during benchmark the original Without mask, benchmark (
Actual cracking (
Here the multiplication by 32 happens in both cases. While I am still puzzled by what happens with mask, I do have some observations:
Also, this scaling: *pcount *= mask_int_cand.num_int_cand; is unconditional in We have many instances of these in mask-aware formats: #if SIZEOF_SIZE_T > 4
/* We can't process more than 4G keys per crypt() */
while (gws_limit * mask_int_cand.num_int_cand > 0xffffffffUL)
gws_limit >>= 1;
#endif However, our count = crk_key_index;
match = crk_methods.crypt_all(&count, salt);
crk_last_key = count;
status_update_crypts((uint64_t)salt->count * count, count); However, formally this is UB, and we probably shouldn't allow values this high or should update that function's prototype. |
In if (db->real && db == db->real) {
[...]
if (options.flags & FLG_MASK_CHK) {
mask_mode = 1; Perhaps that's wrong. |
This fixes the issue for me: +++ b/src/opencl_lm_b_plug.c
@@ -1158,7 +1158,7 @@ static char *get_key_mm(int index)
static void reset(struct db_main *db)
{
- if (db->real && db == db->real) {
+ if (!self_test_running) {
struct db_salt *salt;
unsigned int *bitmaps = NULL;
OFFSET_TABLE_WORD *offset_table = NULL; @fanto666 Can you test this patch, please? I am still puzzled about the benchmark results looking realistic on larger devices. We'll need to also see the effect of the above change on those. |
On Titan X Maxwell, this changes the benchmark speed from ~6000M to ~4400M. The latter is similar to what I'm getting in actual cracking with mask. |
I see no immediate problem with that change but it doesn't address some of the other weirdnesses. This is particularly strange: const int count = mask_mode ?
*pcount : (*pcount + LM_DEPTH - 1) >> LM_LOG_DEPTH; Why would |
Of course, the code is weird, and lacks comments.
That line is dead code -
There's no code to run non-bitsliced. However, apparently when running with a mask, the bit depths are filled using the mask multiplier. This probably means we have efficiency loss when the mask multiplier isn't a multiple of 32 (which it usually isn't). For example, for the 676 seen in our default benchmark mask, the actual number of hashes computed is probably 704, and if so 28 hash computations or almost 4% of total are wasted. I didn't verify this and don't recall past discussions of it - but it's the only plausible explanation I have of what I saw in @sayan1an's host code. Despite of this wastage, it might be the most efficient way to implement mask in there (considering locality of reference). |
retried with the one-line patch from % ~/src/debian/john/bleeding-jumbo/run/john -test -format=LM-opencl breaking random hash gives different results: % ~/src/debian/john/bleeding-jumbo/run/john -format=LM-opencl -session:lm1 -mask='?a?a?a?a?a?a?a?a?a?a?a?a?a?a' lm1 I can try on other GPUs if needed |
Thanks @fanto666. This remaining discrepancy is weird, yet the BTW, there's no point in using a mask longer than 7 with LM hashes since we're cracking their halves. Can you try with a length 7 mask? In particular, can you try with Also, can you try actual cracking with Overall, I think my fix is right and it has worked, and there's something else causing the remaining discrepancy. |
@magnumripper BTW, shouldn't we be rejecting attempted use of a mask longer than the format's maximum? |
% ~/src/debian/john/bleeding-jumbo/run/john -format=LM-opencl -session:lm1 -mask='?a?a?a?a?a?a?a' lm1 % ~/src/debian/john/bleeding-jumbo/run/john -format=LM-opencl -session:lm1 -mask='?a?a?l?u?d?d?s?s' lm1 % ~/src/debian/john/bleeding-jumbo/run/john -format=LM-opencl -session:lm1 -mask='?a?a?a?a?a?a?a?a' lm1 |
for DES I get about the same numbers: % ~/src/debian/john/bleeding-jumbo/run/john -format=descrypt-opencl -session:des -mask='?a?a?a?a?a?a?a?a' des uhlar@fhome% ~/src/debian/john/bleeding-jumbo/run/john -format=descrypt-opencl -session:des -mask='?a?a?a?a?a?a?a' des % ~/src/debian/john/bleeding-jumbo/run/john -format=descrypt-opencl -session:des -mask='?a?a?l?u?d?d?s?s' des |
Thanks. So the fix here appears to be working right, and there's a separate issue with
FWIW, in my testing on Titan X Maxwell, these two masks had similar speeds - IIRC, about 4400M for the benchmark mask, and about 4500M for the all-?a mask - with this difference actually making sense if we consider 676/704 = ~96% vs. 95/96 = ~99%. |
Current behavior is to silently truncate any too long mask (or silently stretch a too short one, where applicable). For warning or even refusing them we need to know that it's an explicitly given one (as opposed to a default one). Maybe that's trivial, maybe not. If you think we really should, please open an issue. |
Please note that the "mask multiplier target" depends on GPU horse power. For LM it's mask_int_cand_target = opencl_speed_index(gpu_id) / 300; ...where the "speed index" is basically the device's clock multiplied with number of cores and SIMD width. You can see it with The speed index is really a best-effort figure: Sometimes we only know the number of "Parallel compute cores" but not the number of "stream processors" per such core (I think we got it right for most AMD and nvidia GPU's though), this will make the figure too low. So it's not perfect by far, but it's the best we can do (I'm open for ideas). Anyway, whatever the "mask multiplier target" ended up with is then used with some kind of fuzz, so for a complete mask of "?a?a?a?a?a?a?a" we might pick "?a?a" even for a much lower target than 9025 but at some point we do pick a single "?a". That's Sayantan's code and I don't know the details of it. |
I was wrong - |
not sure if you are still interested, but here you are. % env LWS=192 GWS=65536 ~/src/debian/john/bleeding-jumbo/run/john -format=LM-opencl -session:lm1 -mask='?a?a?a?a?a?a?a?a' lm1 |
I might not have recalled correctly. I thought those figures were what I saw with debugging |
Matus UHLAR reported this to me via private e-mail after I questioned the benchmark results he posted on the wiki (for several small devices, not only NVIDIA):
Actual cracking:
So descrypt-opencl benchmark speed is sane and lm-opencl's actual cracking speed is even somewhat low for it (could be ~1G, achieved ~160M), but the lm-opencl benchmark speed is insane for this small device (~25G). I am puzzled by this.
This is sort of the opposite of #4381, but I guess the underlying causes of these two issues are independent.
The text was updated successfully, but these errors were encountered: