-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FMT_UTF16 format flag #1631
Comments
See ML discussion: http://www.openwall.com/lists/john-dev/2015/08/09/4 |
I thought everything in jtr was UTF-16 (actually UCS2 in many places). Why the UTF-32 ? |
UTF-32 is the only sane choice. We definitely don't want to handle UTF-16 surrogates within incremental or rules, that's almost as bad as actually processing UTF-8. The UTF-32 -> UTF-16 conversion (eg. in NT format) will be very easy and very fast. If we (optionally) just support UCS-2 for speed, it's merely an int->short downcast similar to the original char->short upcast in speed. |
I thought everything in jtr was UTF-16 (actually UCS2 in many places).
Why the UTF-32 ?
Word lists which mostly contain ascii would also get 4 times as large
instead of just 2 times.
|
Only internally within the memory footprint of john. The external file would be the same. I am still not 100% sold on everything runs as UTF-32. I find that to be wasteful overkill if the file really is just ascii 7 bit, or 99% ascii 7bit. |
We could try to fit both 32-bit and 8-bit code in there simultaneously but it will be significantly harder and a LOT messier. I think the current code stretches the "don't go 32-bit" as far as sensible but the next step is to bite the bullet and go all in. |
We will still need some of the code page support, but hopefully it will be far less encompassing, and will be localized in the code for wordlist.c or loader. We will only need to have character mapping. NOT any of the classification / or casing logic. But adding mapping may be VERY easy, at least for code pages in perl. Here is some code I did to grab the utf8 bytes from code pages supported by perl. #!/usr/bin/perl
use Encode;
my $s;
my $i = 0;
for (; $i < 0x100; ++$i) {
$s = chr($i);
my $cp = decode($ARGV[0], $s);
my $final = encode('UTF-8', $cp);
# do not print character if it is a control char (ord < 31)
if (defined($final) && length($final)>0 && ord($final) > 31) {
# print final on separate print statement. This is due to
# epsadic, since \n is NOT in slot 0x0a. So if we concate
# \n to the final string, and it is an epsidic string, then
# we will NOT have a split file with a character on each line.
print $final;
print "\n";
}
}
print STDERR "code page $ARGV[0] handled\n"; and here are 'most' of the code pages handled by Perl (including epsidic, which are a bit ugly since \n is not where you think it should be, lol) #!/bin/sh
rm -f cp
rm -f cp-chars-all
./cpgen.pl cp37 >> cp
./cpgen.pl cp424 >> cp
./cpgen.pl cp437 >> cp
./cpgen.pl cp500 >> cp
./cpgen.pl cp737 >> cp
./cpgen.pl cp775 >> cp
./cpgen.pl cp850 >> cp
./cpgen.pl cp852 >> cp
./cpgen.pl cp855 >> cp
./cpgen.pl cp856 >> cp
./cpgen.pl cp857 >> cp
./cpgen.pl cp858 >> cp
./cpgen.pl cp860 >> cp
./cpgen.pl cp861 >> cp
./cpgen.pl cp862 >> cp
./cpgen.pl cp863 >> cp
./cpgen.pl cp864 >> cp
./cpgen.pl cp865 >> cp
./cpgen.pl cp866 >> cp
./cpgen.pl cp869 >> cp
./cpgen.pl cp874 >> cp
./cpgen.pl cp875 >> cp
./cpgen.pl cp932 >> cp
./cpgen.pl cp936 >> cp
./cpgen.pl cp949 >> cp
./cpgen.pl cp950 >> cp
./cpgen.pl cp1006 >> cp
./cpgen.pl cp1026 >> cp
./cpgen.pl cp1047 >> cp
./cpgen.pl cp1250 >> cp
./cpgen.pl cp1251 >> cp
./cpgen.pl cp1252 >> cp
./cpgen.pl cp1253 >> cp
./cpgen.pl cp1254 >> cp
./cpgen.pl cp1255 >> cp
./cpgen.pl cp1256 >> cp
./cpgen.pl cp1257 >> cp
./cpgen.pl cp1258 >> cp
./cpgen.pl iso-8859-1 >> cp
./cpgen.pl iso-8859-2 >> cp
./cpgen.pl iso-8859-3 >> cp
./cpgen.pl iso-8859-4 >> cp
./cpgen.pl iso-8859-5 >> cp
./cpgen.pl iso-8859-6 >> cp
./cpgen.pl iso-8859-7 >> cp
./cpgen.pl iso-8859-8 >> cp
./cpgen.pl iso-8859-9 >> cp
./cpgen.pl iso-8859-10 >> cp
./cpgen.pl iso-8859-11 >> cp
./cpgen.pl iso-8859-13 >> cp
./cpgen.pl iso-8859-14 >> cp
./cpgen.pl iso-8859-15 >> cp
./cpgen.pl iso-8859-16 >> cp
./cpgen.pl ascii >> cp
./cpgen.pl US-ascii >> cp
./cpgen.pl ISO-646-US >> cp
./cpgen.pl ISO-646 >> cp
./cpgen.pl ascii-ctrl >> cp
./cpgen.pl latin1 >> cp
./cpgen.pl AdobeStandardEncoding >> cp
./cpgen.pl MacRoman >> cp
./cpgen.pl nextstep >> cp
./cpgen.pl hp-roman8 >> cp
./cpgen.pl MacCentralEurRoman >> cp
./cpgen.pl MacCroatian >> cp
./cpgen.pl MacRomanian >> cp
./cpgen.pl MacRumanian >> cp
./cpgen.pl Latin3 >> cp
./cpgen.pl Latin4 >> cp
./cpgen.pl MacCyrillic >> cp
./cpgen.pl MacUkrainian >> cp
./cpgen.pl Arabic >> cp
./cpgen.pl MacArabic >> cp
./cpgen.pl MacFarsi >> cp
./cpgen.pl Greek >> cp
./cpgen.pl MacGreek >> cp
./cpgen.pl Hebrew >> cp
./cpgen.pl MacHebrew >> cp
./cpgen.pl MacTurkish >> cp
./cpgen.pl MacIcelandic >> cp
./cpgen.pl MacSami >> cp
./cpgen.pl Thai >> cp
./cpgen.pl MacThai >> cp
./cpgen.pl Latin9 >> cp
./cpgen.pl Latin10 >> cp
./cpgen.pl viscii >> cp
./cpgen.pl koi8-f >> cp
./cpgen.pl koi8-r >> cp
./cpgen.pl koi8-u >> cp
./cpgen.pl gsm0338 >> cp
./cpgen.pl euc-cn >> cp
./cpgen.pl gbk >> cp
./cpgen.pl gb12345-raw >> cp
./cpgen.pl gb2312-raw >> cp
./cpgen.pl hz >> cp
./cpgen.pl iso-ir-165 >> cp
./cpgen.pl euc-jp >> cp
./cpgen.pl shiftjis >> cp
./cpgen.pl macJapanese >> cp
./cpgen.pl 7bit-jis >> cp
./cpgen.pl iso-2022-jp >> cp
./cpgen.pl iso-2022-jp-1 >> cp
./cpgen.pl jis0201-raw >> cp
./cpgen.pl jis0208-raw >> cp
./cpgen.pl jis0212-raw >> cp
./cpgen.pl euc-kr >> cp
./cpgen.pl iso-2022-kr >> cp
./cpgen.pl johab >> cp
./cpgen.pl ksc5601-raw >> cp
./cpgen.pl big5-eten >> cp
./cpgen.pl MacChineseTrad >> cp
./cpgen.pl big5 >> cp
./cpgen.pl big5-hkscs >> cp
./cpgen.pl posix-bc >> cp
./cpgen.pl symbol >> cp
./cpgen.pl dingbats >> cp
./cpgen.pl MacDingbats >> cp
./cpgen.pl AdobeZdingbat >> cp
./cpgen.pl AdobeSymbol >> cp
./cpgen.pl GB2312 >> cp
./cpgen.pl macarabic >> cp
./cpgen.pl macgreek >> cp
./cpgen.pl machebrew >> cp
./cpgen.pl macthai >> cp
./cpgen.pl macturkish >> cp
./cpgen.pl macjapanese >> cp
./cpgen.pl mackorean >> cp
./cpgen.pl Cyrillic >> cp
./cpgen.pl macCyrillic >> cp
./cpgen.pl ISO-8859-8 >> cp
./cpgen.pl macThai >> cp
./cpgen.pl US-ASCII >> cp
./cpgen.pl Shift_JIS >> cp
./cpgen.pl EUC-JP >> cp
./cpgen.pl ISO-2022-JP >> cp
./cpgen.pl ISO-2022-JP-1 >> cp
./cpgen.pl EUC-KR >> cp
./cpgen.pl Big5 >> cp
./cpgen.pl GB_2312-80 >> cp
./cpgen.pl EUC-CN >> cp
./cpgen.pl KOI8-U >> cp
./cpgen.pl KOI8-r >> cp
./cpgen.pl KS_C_5601-1987 >> cp
./cpgen.pl ISO-IR-165 >> cp
./cpgen.pl VISCII >> cp
./cpgen.pl UHC >> cp
./cpgen.pl x-windows-949 >> cp
./cpgen.pl GBK >> cp
./cpgen.pl SJIS >> cp
./cpgen.pl CP932 >> cp
./cpgen.pl Windows-31J >> cp
./cpgen.pl Symbol >> cp
run/unique -inp=cp cp-chars-all |
I'm not quite following what you did with that perl script. The code page support will be needed for reading files, and for any target encoding used. Example UTF-8 wordlist -> rules -> filters -> crk_set_key() -> LM format set_key() (using cp) The UTF-8 will be converted to UTF-32 in wordlist.c, then stay UTF-32 all the way until cracker.c is about to call format's set_key(). Just before that, it needs to convert to eg. CP850. Note that in this very case, the current code is probably much more efficient. |
We have to handle the case that the target code page doesn't include characters that might be read from a wordlist. OTOH, we do have that problem right now as well. How is that handled? |
That is an issue now and will be no matter how we re-write this. It does, and always will, result in garbage candidates. If we at all support reading non-UTF-8 wordlists, the same applies there: What if it's written in a non-supported codepage like ArmSCII-8? Like today, we'd need to run transparent. A sensible way of handling it would be "convert assuming ISO-8859-1", "process as ASCII" [i.e. ignore non-ASCII when case-toggling and so on] and "convert back assuming ISO-8859-1". This will work pretty much the same as current |
?? I would think forcing users to use iconv to put their wordlists into utf8 would be one option, but I bet users would NOT like that much. The main problem is 'shit' wordlists, that are a hodge podge of mixed character sets. Those dirty wordlists still abound around the net, and people like to use them. Yes, if every word in there was properly converted to utf8, then wow, it would be SO much better. But how do we help get from point dirty to point utf8 clean ? Or do we simply not care, and tell users that the word lists need to be in utf8 so there is no ambiguity within JtR? |
I think we should support it, but we could opt to not supporting it. For mixed-encodings wordlists, that "transparent mode" concept is mandatory. However, it won't work well for Unicode hashes like NT. I never has and never will, in any cracker - it simply can't. BTW a problem with "transparent mode" is your pot file entries will not be UTF-8. Ideally we should have a field in the .pot file stating this is the case, and for such entries -show would print eg.
That's "Müller" in CP850. |
I'm digressing now, but ideally the pot file format would always be
That would work with tabs, colons and whatever, and with any encoding including transparent [== raw == unknown] encoding. It would always be totally reproducible. |
We may want to have the .pot file enhanced if we allow anything OTHER than utf8 to be stored in it. The enhancement would be to somehow deliver across information about the encoding. Lol, you beat me to the punch |
I have considered various ideas over time:
|
I actually REALLY like that .pot (.pv2 ?) file layout. It is absolutely concise. If we changed it to this: hash : \t Then we would not have to change the .pot file at all. We could still read the legacy stuff, but if the line contains |
For transparent mode, the plain hex is the only thing that will |
I would really poo-poo XML. Yes it you can do anything with it, BUT it is slow, and is a nightmare for stuff like this, where quoting will have to be done all over the place. CSV is no better (we already are CSV, but with 2 fields, and the separator being the 'first' ':' seen on the line. |
Also, consider the case where a user fed a CP123 wordlist but stated CP234, and by coincidence some LM hashes was cracked. That plaintext-as-hex will show what was ACTUALLY the password, no matter the encoding, as recorded, was incorrect (and as a result, a Unicode print from -show will be incorrect too). Yes, I hate XML. And the current format is just a variation of CSV. |
Should these last few comments be moved to it's own topic, Possibly as an RFC type ? |
We should probably have discussed all this on john-dev instead... |
I wish john-dev was a web forum, with GitHub markup 😄 |
It is not bad to get things pie in the sky talked about offline, then bring to john-dev, but usually only crickets chirp there. |
I hate the email lists. So difficult to find anything. Yes, the github is also a cluster, trying to find old stuff, BUT for doing 'hot in the trenches' stuff, it makes following along very easy |
This new flag should be named FMT_UTF32 and not only |
I think this also obsoletes both FMT_UNICODE and FMT_UTF8 flags. |
So, if set_key can be either of static void set_key(char *key, int index); and static void set_key(UTF32 *key, int index); What is the canonical way to declare it? I see at least two solutions. One is to declare it as void pointer in formats.h: static void set_key(void *key, int index); and the other is to actually add a function to the struct, static void set_key(char *key, int index);
static void set_key32(UTF32 *key, int index); If we go with the latter, any format will probably have one of them as NULL and the other one defined. Actually that would mean we don't strictly need the FMT_UTF32 flag, we could just test if |
On a second thought: With the proposed changes, we'd detect in cracker.c that the target encoding can't hold the needed characters - so we could reject the candidate and never send it to the format. This is a good thing. |
I think we need to step back a bit. We might want to totally relook at the interface. It may be time to provide an interface that the conversion code dumps data right into the key space (on set_key). Get_key IMHO is not hot code. Finding a crack is NOT expected, it is an exception. But setting keys IS hot code.
Switch to a real language 😈 where issues like this are simply handled by the compiler. |
For example: within converter (that marshals data from the crack type into the format), the format during init() would simply make 'setter' type calls into the converter, providing it with all the information that is required to properly load the data (number of buffers, layout of buffers, size of buffers, byte ordering of buffers, whether to MD the buffers, provided alignment, buffer cleaning required, etc). We would likely have to still provide the existing interface, where 1 word at a time is sent to the format to have it load into its memory layout, for formats which have complex (or extremely simple) data layouts. Formats that simply work on a single password at a time, could likely keep the existing naive setkey() layout. |
No but it's logical to return a key in same format you got it. And it simplifies format code, the conversion back to UTF-8 would be in cracker.c instead of duplicated in all Unicode formats. |
BTW for putting code where most effective, we should actually have it as FMT_UTF16. This means set_key() and get_key() will set/get UTF-16 with surrogates - BUT all of core should still have it as UTF-32. This will mean really simple code in formats. On another note, while doing all this we'll ensure the key delivered from cracker.c is aligned properly for any use (ie. aligned to 8) no matter if it's char* or UTF16*. |
Real code don't need a compiler, only an assembler 😈 |
Why an assembler? http://www.catb.org/jargon/html/story-of-mel.html |
This flag indicates that set_key() is to take a
UTF32*
UTF16*
instead of achar*
pointer to key. Formats like NT and Oracle will set it.For eg. Incremental mode using UTF-32 internally, this means we don't have to go through (slow) UTF-8 conversion just to satisfy the legacy set_key() prototype.
The text was updated successfully, but these errors were encountered: