Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LM-opencl benchmark much slower than actual cracking #4381

Open
solardiz opened this issue Sep 29, 2020 · 1 comment
Open

LM-opencl benchmark much slower than actual cracking #4381

solardiz opened this issue Sep 29, 2020 · 1 comment
Labels
RFC / discussion Help or comments wanted

Comments

@solardiz
Copy link
Member

Default benchmark:

$ john -te -form=lm-opencl
Device 1: Tesla V100-SXM2-16GB
Benchmarking: LM-opencl [DES BS OpenCL/mask accel]... LWS=128 GWS=131072 DONE
Raw:    6224M c/s real, 6025M c/s virtual

Different mask:

$ john -te -form=lm-opencl -mask='?a?a?a?a?a?a?a'
Device 1: Tesla V100-SXM2-16GB
Benchmarking: LM-opencl (length 7) [DES BS OpenCL/mask accel]... LWS=128 GWS=524288 DONE
Raw:    8057M c/s real, 7455M c/s virtual

Also longer benchmark (didn't make a difference):

$ john -te=60 -form=lm-opencl -mask='?a?a?a?a?a?a?a'
Device 1: Tesla V100-SXM2-16GB
Benchmarking: LM-opencl (length 7) [DES BS OpenCL/mask accel]... LWS=128 GWS=524288 DONE
Raw:    8041M c/s real, 7480M c/s virtual

Actual cracking:

$ john sample-hashes-windows -form=lm-opencl -mask='?a' -min-len=7 -max-len=7
Device 1: Tesla V100-SXM2-16GB
Using default input encoding: UTF-8
Using default target encoding: CP850
Loaded 2996 password hashes with no different salts (LM-opencl [DES BS OpenCL])
Remaining 254 password hashes with no different salts
LWS=128 GWS=524288
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:10 0.20% (ETA: 16:47:27) 0g/s 1497Mp/s 1497Mc/s 11068046TC/s AAY=-0A
0g 0:00:00:20 2.11% (ETA: 15:40:23) 0g/s 7862Mp/s 7862Mc/s 2767010TC/s AA?/_FE
0g 0:00:00:31 4.22% (ETA: 15:36:49) 0g/s 10145Mp/s 10145Mc/s 1190110TC/s AAZ4R^1
0g 0:00:00:43 6.50% (ETA: 15:35:37) 0g/s 11261Mp/s 11261Mc/s 9437866TC/s AAL!@ZO
0g 0:00:00:51 8.05% (ETA: 15:35:09) 0g/s 11746Mp/s 11746Mc/s 13021228TC/s AAV01!N
0g 0:00:01:00 9.75% (ETA: 15:34:51) 0g/s 12106Mp/s 12106Mc/s 15679730TC/s AA$UJ=R
0g 0:00:01:12 12.03% (ETA: 15:34:34) 0g/s 12446Mp/s 12446Mc/s 18190537TC/s AA5*>4S
Session aborted

So even when comparing against 254 loaded hashes, we got much better speed than what the benchmark got with the same mask after running for the same time. (Somehow the speed was poor early on, and it kept growing. In fact, the average speed would be even higher for a longer run.)

Checking nvidia-smi, I see that GPU utilization is somewhat low during actual cracking (around 75%) and even lower during benchmark (after the auto-tuning is complete, it nevertheless fluctuates between 0% and 80%, with average perhaps around 40%).

The lower GPU utilization during benchmark explains the speed difference, but I am puzzled why the utilization is lower. We could also look into and improve GPU utilization during actual cracking, and switch to a more suitable default mask for benchmarks.

@solardiz solardiz added the RFC / discussion Help or comments wanted label Sep 29, 2020
@solardiz
Copy link
Member Author

solardiz commented Sep 29, 2020

switch to a more suitable default mask for benchmarks.

OTOH, that different mask makes NT-opencl usually (but not always) auto-tune to unreasonably high GWS, which hurts a lot:

$ john -te -form=nt-opencl
Device 1: Tesla V100-SXM2-16GB
Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... LWS=128 GWS=40960 (320 blocks) x24700 DONE
Raw:    34558M c/s real, 34558M c/s virtual

$ john -te -form=nt-opencl -mask='?a?a?a?a?a?a?a'
Device 1: Tesla V100-SXM2-16GB
Benchmarking: NT-opencl (length 7) [MD4 OpenCL/mask accel]... LWS=128 GWS=655360 (5120 blocks) x9025 DONE
Raw:    9434M c/s real, 9389M c/s virtual

$ GWS=40960 john -te -form=nt-opencl -mask='?a?a?a?a?a?a?a'
Device 1: Tesla V100-SXM2-16GB
Benchmarking: NT-opencl (length 7) [MD4 OpenCL/mask accel]... LWS=128 GWS=40960 (320 blocks) x9025 DONE
Raw:    33839M c/s real, 33839M c/s virtual

Maybe there's an auto-tuning shortcoming for us to fix there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC / discussion Help or comments wanted
Projects
None yet
Development

No branches or pull requests

1 participant