Add FMT_UTF16 format flag #1631

magnumripper · 2015-08-09T22:23:38Z

This flag indicates that set_key() is to take a ~~UTF32*~~ UTF16* instead of a char* pointer to key. Formats like NT and Oracle will set it.

For eg. Incremental mode using UTF-32 internally, this means we don't have to go through (slow) UTF-8 conversion just to satisfy the legacy set_key() prototype.

The text was updated successfully, but these errors were encountered:

magnumripper · 2015-08-09T22:28:33Z

See ML discussion: http://www.openwall.com/lists/john-dev/2015/08/09/4

jfoug · 2015-08-09T22:35:00Z

I thought everything in jtr was UTF-16 (actually UCS2 in many places). Why the UTF-32 ?

magnumripper · 2015-08-09T23:56:10Z

UTF-32 is the only sane choice. We definitely don't want to handle UTF-16 surrogates within incremental or rules, that's almost as bad as actually processing UTF-8.

The UTF-32 -> UTF-16 conversion (eg. in NT format) will be very easy and very fast. If we (optionally) just support UCS-2 for speed, it's merely an int->short downcast similar to the original char->short upcast in speed.

frank-dittrich · 2015-08-10T11:13:25Z

I thought everything in jtr was UTF-16 (actually UCS2 in many places). Why the UTF-32 ?

Word lists which mostly contain ascii would also get 4 times as large instead of just 2 times.

jfoug · 2015-08-10T13:14:57Z

Word lists which mostly contain ascii would also get 4 times as large instead of just 2 times.

Only internally within the memory footprint of john. The external file would be the same.

I am still not 100% sold on everything runs as UTF-32. I find that to be wasteful overkill if the file really is just ascii 7 bit, or 99% ascii 7bit.

magnumripper · 2015-08-11T11:23:30Z

We could try to fit both 32-bit and 8-bit code in there simultaneously but it will be significantly harder and a LOT messier. I think the current code stretches the "don't go 32-bit" as far as sensible but the next step is to bite the bullet and go all in.

jfoug · 2015-08-11T12:21:45Z

We will still need some of the code page support, but hopefully it will be far less encompassing, and will be localized in the code for wordlist.c or loader. We will only need to have character mapping. NOT any of the classification / or casing logic. But adding mapping may be VERY easy, at least for code pages in perl. Here is some code I did to grab the utf8 bytes from code pages supported by perl.

#!/usr/bin/perl
use Encode;
my $s;
my $i = 0;
for (; $i < 0x100; ++$i) {
    $s = chr($i);
    my $cp = decode($ARGV[0], $s);
    my $final = encode('UTF-8', $cp);
    # do not print character if it is a control char (ord < 31)
    if (defined($final) && length($final)>0 && ord($final) > 31) {
        # print final on separate print statement. This is due to
        # epsadic, since \n is NOT in slot 0x0a. So if we concate
        # \n to the final string, and it is an epsidic string, then
        # we will NOT have a split file with a character on each line.
        print $final;
        print "\n";
    }
}
print STDERR "code page $ARGV[0] handled\n";

and here are 'most' of the code pages handled by Perl (including epsidic, which are a bit ugly since \n is not where you think it should be, lol)

#!/bin/sh
rm -f cp
rm -f cp-chars-all
./cpgen.pl cp37 >> cp
./cpgen.pl cp424 >> cp
./cpgen.pl cp437 >> cp
./cpgen.pl cp500 >> cp
./cpgen.pl cp737 >> cp
./cpgen.pl cp775 >> cp
./cpgen.pl cp850 >> cp
./cpgen.pl cp852 >> cp
./cpgen.pl cp855 >> cp
./cpgen.pl cp856 >> cp
./cpgen.pl cp857 >> cp
./cpgen.pl cp858 >> cp
./cpgen.pl cp860 >> cp
./cpgen.pl cp861 >> cp
./cpgen.pl cp862 >> cp
./cpgen.pl cp863 >> cp
./cpgen.pl cp864 >> cp
./cpgen.pl cp865 >> cp
./cpgen.pl cp866 >> cp
./cpgen.pl cp869 >> cp
./cpgen.pl cp874 >> cp
./cpgen.pl cp875 >> cp
./cpgen.pl cp932 >> cp
./cpgen.pl cp936 >> cp
./cpgen.pl cp949 >> cp
./cpgen.pl cp950 >> cp
./cpgen.pl cp1006 >> cp
./cpgen.pl cp1026 >> cp
./cpgen.pl cp1047 >> cp
./cpgen.pl cp1250 >> cp
./cpgen.pl cp1251 >> cp
./cpgen.pl cp1252 >> cp
./cpgen.pl cp1253 >> cp
./cpgen.pl cp1254 >> cp
./cpgen.pl cp1255 >> cp
./cpgen.pl cp1256 >> cp
./cpgen.pl cp1257 >> cp
./cpgen.pl cp1258 >> cp
./cpgen.pl iso-8859-1 >> cp
./cpgen.pl iso-8859-2 >> cp
./cpgen.pl iso-8859-3 >> cp
./cpgen.pl iso-8859-4 >> cp
./cpgen.pl iso-8859-5 >> cp
./cpgen.pl iso-8859-6 >> cp
./cpgen.pl iso-8859-7 >> cp
./cpgen.pl iso-8859-8 >> cp
./cpgen.pl iso-8859-9 >> cp
./cpgen.pl iso-8859-10 >> cp
./cpgen.pl iso-8859-11 >> cp
./cpgen.pl iso-8859-13 >> cp
./cpgen.pl iso-8859-14 >> cp
./cpgen.pl iso-8859-15 >> cp
./cpgen.pl iso-8859-16 >> cp
./cpgen.pl ascii >> cp
./cpgen.pl US-ascii >> cp
./cpgen.pl ISO-646-US >> cp
./cpgen.pl ISO-646 >> cp
./cpgen.pl ascii-ctrl >> cp
./cpgen.pl latin1 >> cp
./cpgen.pl AdobeStandardEncoding >> cp
./cpgen.pl MacRoman >> cp
./cpgen.pl nextstep >> cp
./cpgen.pl hp-roman8 >> cp
./cpgen.pl MacCentralEurRoman >> cp
./cpgen.pl MacCroatian >> cp
./cpgen.pl MacRomanian >> cp
./cpgen.pl MacRumanian >> cp
./cpgen.pl Latin3 >> cp
./cpgen.pl Latin4 >> cp
./cpgen.pl MacCyrillic >> cp
./cpgen.pl MacUkrainian >> cp
./cpgen.pl Arabic >> cp
./cpgen.pl MacArabic >> cp
./cpgen.pl MacFarsi >> cp
./cpgen.pl Greek >> cp
./cpgen.pl MacGreek >> cp
./cpgen.pl Hebrew >> cp
./cpgen.pl MacHebrew >> cp
./cpgen.pl MacTurkish >> cp
./cpgen.pl MacIcelandic >> cp
./cpgen.pl MacSami >> cp
./cpgen.pl Thai >> cp
./cpgen.pl MacThai >> cp
./cpgen.pl Latin9 >> cp
./cpgen.pl Latin10 >> cp
./cpgen.pl viscii >> cp
./cpgen.pl koi8-f >> cp
./cpgen.pl koi8-r >> cp
./cpgen.pl koi8-u >> cp
./cpgen.pl gsm0338 >> cp
./cpgen.pl euc-cn >> cp
./cpgen.pl gbk >> cp
./cpgen.pl gb12345-raw >> cp
./cpgen.pl gb2312-raw >> cp
./cpgen.pl hz >> cp
./cpgen.pl iso-ir-165 >> cp
./cpgen.pl euc-jp >> cp
./cpgen.pl shiftjis >> cp
./cpgen.pl macJapanese >> cp
./cpgen.pl 7bit-jis >> cp
./cpgen.pl iso-2022-jp >> cp
./cpgen.pl iso-2022-jp-1 >> cp
./cpgen.pl jis0201-raw >> cp
./cpgen.pl jis0208-raw >> cp
./cpgen.pl jis0212-raw >> cp
./cpgen.pl euc-kr >> cp
./cpgen.pl iso-2022-kr >> cp
./cpgen.pl johab >> cp
./cpgen.pl ksc5601-raw >> cp
./cpgen.pl big5-eten >> cp
./cpgen.pl MacChineseTrad >> cp
./cpgen.pl big5 >> cp
./cpgen.pl big5-hkscs >> cp
./cpgen.pl posix-bc >> cp
./cpgen.pl symbol >> cp
./cpgen.pl dingbats >> cp
./cpgen.pl MacDingbats >> cp
./cpgen.pl AdobeZdingbat >> cp
./cpgen.pl AdobeSymbol >> cp
./cpgen.pl GB2312 >> cp
./cpgen.pl macarabic >> cp
./cpgen.pl macgreek >> cp
./cpgen.pl machebrew >> cp
./cpgen.pl macthai >> cp
./cpgen.pl macturkish >> cp
./cpgen.pl macjapanese >> cp
./cpgen.pl mackorean >> cp
./cpgen.pl Cyrillic >> cp
./cpgen.pl macCyrillic >> cp
./cpgen.pl ISO-8859-8 >> cp
./cpgen.pl macThai >> cp
./cpgen.pl US-ASCII >> cp
./cpgen.pl Shift_JIS >> cp
./cpgen.pl EUC-JP >> cp
./cpgen.pl ISO-2022-JP >> cp
./cpgen.pl ISO-2022-JP-1 >> cp
./cpgen.pl EUC-KR >> cp
./cpgen.pl Big5 >> cp
./cpgen.pl GB_2312-80 >> cp
./cpgen.pl EUC-CN >> cp
./cpgen.pl KOI8-U >> cp
./cpgen.pl KOI8-r >> cp
./cpgen.pl KS_C_5601-1987 >> cp
./cpgen.pl ISO-IR-165 >> cp
./cpgen.pl VISCII >> cp
./cpgen.pl UHC >> cp
./cpgen.pl x-windows-949 >> cp
./cpgen.pl GBK >> cp
./cpgen.pl SJIS >> cp
./cpgen.pl CP932 >> cp
./cpgen.pl Windows-31J >> cp
./cpgen.pl Symbol >> cp

run/unique -inp=cp cp-chars-all

magnumripper · 2015-08-11T13:10:59Z

I'm not quite following what you did with that perl script.

The code page support will be needed for reading files, and for any target encoding used. Example

UTF-8 wordlist -> rules -> filters -> crk_set_key() -> LM format set_key() (using cp)

The UTF-8 will be converted to UTF-32 in wordlist.c, then stay UTF-32 all the way until cracker.c is about to call format's set_key(). Just before that, it needs to convert to eg. CP850. Note that in this very case, the current code is probably much more efficient.

frank-dittrich · 2015-08-11T14:17:05Z

We have to handle the case that the target code page doesn't include characters that might be read from a wordlist. OTOH, we do have that problem right now as well. How is that handled?

magnumripper · 2015-08-11T17:11:34Z

That is an issue now and will be no matter how we re-write this. It does, and always will, result in garbage candidates.

If we at all support reading non-UTF-8 wordlists, the same applies there: What if it's written in a non-supported codepage like ArmSCII-8? Like today, we'd need to run transparent. A sensible way of handling it would be "convert assuming ISO-8859-1", "process as ASCII" [i.e. ignore non-ASCII when case-toggling and so on] and "convert back assuming ISO-8859-1". This will work pretty much the same as current -enc=raw I think.

jfoug · 2015-08-11T17:21:37Z

If we at all support reading non-UTF-8 wordlists

?? I would think forcing users to use iconv to put their wordlists into utf8 would be one option, but I bet users would NOT like that much. The main problem is 'shit' wordlists, that are a hodge podge of mixed character sets. Those dirty wordlists still abound around the net, and people like to use them. Yes, if every word in there was properly converted to utf8, then wow, it would be SO much better. But how do we help get from point dirty to point utf8 clean ? Or do we simply not care, and tell users that the word lists need to be in utf8 so there is no ambiguity within JtR?

magnumripper · 2015-08-11T17:34:09Z

I think we should support it, but we could opt to not supporting it.

For mixed-encodings wordlists, that "transparent mode" concept is mandatory. However, it won't work well for Unicode hashes like NT. I never has and never will, in any cracker - it simply can't.

BTW a problem with "transparent mode" is your pot file entries will not be UTF-8. Ideally we should have a field in the .pot file stating this is the case, and for such entries -show would print eg.

Administrator:M.ller [4d 81 6c 6c 65 72]

That's "Müller" in CP850.

magnumripper · 2015-08-11T17:40:11Z

I'm digressing now, but ideally the pot file format would always be

<hash> : <encoding> : <hexdump of plain in target encoding>

That would work with tabs, colons and whatever, and with any encoding including transparent [== raw == unknown] encoding. It would always be totally reproducible.

jfoug · 2015-08-11T17:41:04Z

We may want to have the .pot file enhanced if we allow anything OTHER than utf8 to be stored in it. The enhancement would be to somehow deliver across information about the encoding.

Lol, you beat me to the punch

magnumripper · 2015-08-11T17:48:00Z

I have considered various ideas over time:

Simply change the pot file format, adding fields and functionality. Possibly also start defaulting to using TAB as separator. Or simply use XML or CSV (but quoting will be hell). Actually, the "hex dump" approach as above is the safest and simplest (and we could add more fields to it if we wanted).
Use a different pot file for raw mode. This will be transparent for the user, she won't notice the difference. Known encodings will be (by default) in john.pot and raw stuff will be in john.raw.pot - or something like that. When you use -show, it will read both files.
A variation of (2): Keep current john.pot as today but add more information in a second file. This is tricky and error prone.

jfoug · 2015-08-11T17:48:19Z

I actually REALLY like that .pot (.pv2 ?) file layout. It is absolutely concise. If we changed it to this:

hash : \t $V2$ \t encoding \t plain-in-hex : plain

Then we would not have to change the .pot file at all. We could still read the legacy stuff, but if the line contains $V2$ as the found password, and has \tsome-valid-encoding\t following that, then we know this is a V2 line, and handle it appropriately and THEN the .pot file actually could have 'mixed' cp data, have the "REAL" data in the file, etc. Actually, would we really 'need' the plain-in-hex ?

magnumripper · 2015-08-11T17:49:28Z

For transparent mode, the plain hex is the only thing that will ~~ever~~ always be proper.

jfoug · 2015-08-11T17:52:23Z

I would really poo-poo XML. Yes it you can do anything with it, BUT it is slow, and is a nightmare for stuff like this, where quoting will have to be done all over the place.

CSV is no better (we already are CSV, but with 2 fields, and the separator being the 'first' ':' seen on the line.

magnumripper · 2015-08-11T17:53:02Z

Also, consider the case where a user fed a CP123 wordlist but stated CP234, and by coincidence some LM hashes was cracked. That plaintext-as-hex will show what was ACTUALLY the password, no matter the encoding, as recorded, was incorrect (and as a result, a Unicode print from -show will be incorrect too).

Yes, I hate XML. And the current format is just a variation of CSV.

jfoug · 2015-08-11T17:53:12Z

Should these last few comments be moved to it's own topic, Possibly as an RFC type ?

magnumripper · 2015-08-11T17:53:39Z

We should probably have discussed all this on john-dev instead...

magnumripper · 2015-08-11T17:54:03Z

I wish john-dev was a web forum, with GitHub markup 😄

jfoug · 2015-08-11T17:54:24Z

It is not bad to get things pie in the sky talked about offline, then bring to john-dev, but usually only crickets chirp there.

jfoug · 2015-08-11T17:55:31Z

I hate the email lists. So difficult to find anything. Yes, the github is also a cluster, trying to find old stuff, BUT for doing 'hot in the trenches' stuff, it makes following along very easy

magnumripper · 2015-08-11T23:55:42Z

This new flag should be named FMT_UTF32 and not only set_key() should be affected but also get_key().

magnumripper · 2015-08-11T23:56:58Z

I think this also obsoletes both FMT_UNICODE and FMT_UTF8 flags.

magnumripper · 2015-08-12T00:53:05Z

So, if set_key can be either of

static void set_key(char *key, int index);

and

static void set_key(UTF32 *key, int index);

What is the canonical way to declare it? I see at least two solutions. One is to declare it as void pointer in formats.h:

static void set_key(void *key, int index);

and the other is to actually add a function to the struct,

static void set_key(char *key, int index);
static void set_key32(UTF32 *key, int index);

If we go with the latter, any format will probably have one of them as NULL and the other one defined. Actually that would mean we don't strictly need the FMT_UTF32 flag, we could just test if fmt->methods.set_key32 is NULL or not.

magnumripper · 2015-08-12T01:11:15Z

We have to handle the case that the target code page doesn't include characters that might be read from a wordlist. OTOH, we do have that problem right now as well. How is that handled?

That is an issue now and will be no matter how we re-write this. It does, and always will, result in garbage candidates.

On a second thought: With the proposed changes, we'd detect in cracker.c that the target encoding can't hold the needed characters - so we could reject the candidate and never send it to the format. This is a good thing.

jfoug · 2015-08-12T03:49:20Z

So, if set_key can be either of

I think we need to step back a bit. We might want to totally relook at the interface. It may be time to provide an interface that the conversion code dumps data right into the key space (on set_key). Get_key IMHO is not hot code. Finding a crack is NOT expected, it is an exception. But setting keys IS hot code.

What is the canonical way to declare it?

Switch to a real language 😈 where issues like this are simply handled by the compiler.

jfoug · 2015-08-12T04:00:20Z

We might want to totally relook at the interface.

For example:

within converter (that marshals data from the crack type into the format), the format during init() would simply make 'setter' type calls into the converter, providing it with all the information that is required to properly load the data (number of buffers, layout of buffers, size of buffers, byte ordering of buffers, whether to MD the buffers, provided alignment, buffer cleaning required, etc). We would likely have to still provide the existing interface, where 1 word at a time is sent to the format to have it load into its memory layout, for formats which have complex (or extremely simple) data layouts. Formats that simply work on a single password at a time, could likely keep the existing naive setkey() layout.

magnumripper · 2015-08-12T08:59:15Z

Get_key IMHO is not hot code.

No but it's logical to return a key in same format you got it. And it simplifies format code, the conversion back to UTF-8 would be in cracker.c instead of duplicated in all Unicode formats.

magnumripper · 2015-08-12T09:06:53Z

BTW for putting code where most effective, we should actually have it as FMT_UTF16. This means set_key() and get_key() will set/get UTF-16 with surrogates - BUT all of core should still have it as UTF-32. This will mean really simple code in formats.

On another note, while doing all this we'll ensure the key delivered from cracker.c is aligned properly for any use (ie. aligned to 8) no matter if it's char* or UTF16*.

magnumripper · 2015-08-12T09:12:59Z

What is the canonical way to declare it?

Switch to a real language 😈 where issues like this are simply handled by the compiler.

Real code don't need a compiler, only an assembler 😈

frank-dittrich · 2015-08-12T09:21:38Z

Why an assembler? http://www.catb.org/jargon/html/story-of-mel.html

magnumripper added enhancement non-trivial labels Aug 9, 2015

This was referenced Aug 9, 2015

Add Unicode support to incremental #1627

Open

Add Unicode support to rules #1628

Open

Add Unicode support to MASK mode #1629

Open

magnumripper mentioned this issue Aug 9, 2015

Full UTF-32 support through-out Jumbo #1632

Open

9 tasks

magnumripper changed the title ~~Add FMT_SET_KEY_32 format flag~~ Add FMT_UTF32 format flag Aug 11, 2015

magnumripper changed the title ~~Add FMT_UTF32 format flag~~ Add FMT_UTF16 format flag Aug 12, 2015

magnumripper mentioned this issue Dec 15, 2018

Format encoding flags revision #3509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FMT_UTF16 format flag #1631

Add FMT_UTF16 format flag #1631

magnumripper commented Aug 9, 2015

magnumripper commented Aug 9, 2015

jfoug commented Aug 9, 2015

magnumripper commented Aug 9, 2015

frank-dittrich commented Aug 10, 2015 via email

jfoug commented Aug 10, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

frank-dittrich commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

magnumripper commented Aug 11, 2015

magnumripper commented Aug 12, 2015

magnumripper commented Aug 12, 2015

jfoug commented Aug 12, 2015

jfoug commented Aug 12, 2015

magnumripper commented Aug 12, 2015

magnumripper commented Aug 12, 2015

magnumripper commented Aug 12, 2015

frank-dittrich commented Aug 12, 2015

Add FMT_UTF16 format flag #1631

Add FMT_UTF16 format flag #1631

Comments

magnumripper commented Aug 9, 2015

magnumripper commented Aug 9, 2015

jfoug commented Aug 9, 2015

magnumripper commented Aug 9, 2015

frank-dittrich commented Aug 10, 2015 via email

jfoug commented Aug 10, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

frank-dittrich commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

magnumripper commented Aug 11, 2015

jfoug commented Aug 11, 2015

jfoug commented Aug 11, 2015

magnumripper commented Aug 11, 2015

magnumripper commented Aug 11, 2015

magnumripper commented Aug 12, 2015

magnumripper commented Aug 12, 2015

jfoug commented Aug 12, 2015

jfoug commented Aug 12, 2015

magnumripper commented Aug 12, 2015

magnumripper commented Aug 12, 2015

magnumripper commented Aug 12, 2015

frank-dittrich commented Aug 12, 2015