Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FMT_UTF16 format flag #1631

Open
magnumripper opened this issue Aug 9, 2015 · 34 comments
Open

Add FMT_UTF16 format flag #1631

magnumripper opened this issue Aug 9, 2015 · 34 comments

Comments

@magnumripper
Copy link
Member

This flag indicates that set_key() is to take a UTF32* UTF16* instead of a char* pointer to key. Formats like NT and Oracle will set it.

For eg. Incremental mode using UTF-32 internally, this means we don't have to go through (slow) UTF-8 conversion just to satisfy the legacy set_key() prototype.

@magnumripper
Copy link
Member Author

See ML discussion: http://www.openwall.com/lists/john-dev/2015/08/09/4

@jfoug
Copy link
Collaborator

jfoug commented Aug 9, 2015

I thought everything in jtr was UTF-16 (actually UCS2 in many places). Why the UTF-32 ?

@magnumripper
Copy link
Member Author

UTF-32 is the only sane choice. We definitely don't want to handle UTF-16 surrogates within incremental or rules, that's almost as bad as actually processing UTF-8.

The UTF-32 -> UTF-16 conversion (eg. in NT format) will be very easy and very fast. If we (optionally) just support UCS-2 for speed, it's merely an int->short downcast similar to the original char->short upcast in speed.

@frank-dittrich
Copy link
Collaborator

frank-dittrich commented Aug 10, 2015 via email

@jfoug
Copy link
Collaborator

jfoug commented Aug 10, 2015

Word lists which mostly contain ascii would also get 4 times as large instead of just 2 times.

Only internally within the memory footprint of john. The external file would be the same.

I am still not 100% sold on everything runs as UTF-32. I find that to be wasteful overkill if the file really is just ascii 7 bit, or 99% ascii 7bit.

@magnumripper
Copy link
Member Author

We could try to fit both 32-bit and 8-bit code in there simultaneously but it will be significantly harder and a LOT messier. I think the current code stretches the "don't go 32-bit" as far as sensible but the next step is to bite the bullet and go all in.

@jfoug
Copy link
Collaborator

jfoug commented Aug 11, 2015

We will still need some of the code page support, but hopefully it will be far less encompassing, and will be localized in the code for wordlist.c or loader. We will only need to have character mapping. NOT any of the classification / or casing logic. But adding mapping may be VERY easy, at least for code pages in perl. Here is some code I did to grab the utf8 bytes from code pages supported by perl.

#!/usr/bin/perl
use Encode;
my $s;
my $i = 0;
for (; $i < 0x100; ++$i) {
    $s = chr($i);
    my $cp = decode($ARGV[0], $s);
    my $final = encode('UTF-8', $cp);
    # do not print character if it is a control char (ord < 31)
    if (defined($final) && length($final)>0 && ord($final) > 31) {
        # print final on separate print statement. This is due to
        # epsadic, since \n is NOT in slot 0x0a. So if we concate
        # \n to the final string, and it is an epsidic string, then
        # we will NOT have a split file with a character on each line.
        print $final;
        print "\n";
    }
}
print STDERR "code page $ARGV[0] handled\n";

and here are 'most' of the code pages handled by Perl (including epsidic, which are a bit ugly since \n is not where you think it should be, lol)

#!/bin/sh
rm -f cp
rm -f cp-chars-all
./cpgen.pl cp37 >> cp
./cpgen.pl cp424 >> cp
./cpgen.pl cp437 >> cp
./cpgen.pl cp500 >> cp
./cpgen.pl cp737 >> cp
./cpgen.pl cp775 >> cp
./cpgen.pl cp850 >> cp
./cpgen.pl cp852 >> cp
./cpgen.pl cp855 >> cp
./cpgen.pl cp856 >> cp
./cpgen.pl cp857 >> cp
./cpgen.pl cp858 >> cp
./cpgen.pl cp860 >> cp
./cpgen.pl cp861 >> cp
./cpgen.pl cp862 >> cp
./cpgen.pl cp863 >> cp
./cpgen.pl cp864 >> cp
./cpgen.pl cp865 >> cp
./cpgen.pl cp866 >> cp
./cpgen.pl cp869 >> cp
./cpgen.pl cp874 >> cp
./cpgen.pl cp875 >> cp
./cpgen.pl cp932 >> cp
./cpgen.pl cp936 >> cp
./cpgen.pl cp949 >> cp
./cpgen.pl cp950 >> cp
./cpgen.pl cp1006 >> cp
./cpgen.pl cp1026 >> cp
./cpgen.pl cp1047 >> cp
./cpgen.pl cp1250 >> cp
./cpgen.pl cp1251 >> cp
./cpgen.pl cp1252 >> cp
./cpgen.pl cp1253 >> cp
./cpgen.pl cp1254 >> cp
./cpgen.pl cp1255 >> cp
./cpgen.pl cp1256 >> cp
./cpgen.pl cp1257 >> cp
./cpgen.pl cp1258 >> cp
./cpgen.pl iso-8859-1 >> cp
./cpgen.pl iso-8859-2 >> cp
./cpgen.pl iso-8859-3 >> cp
./cpgen.pl iso-8859-4 >> cp
./cpgen.pl iso-8859-5 >> cp
./cpgen.pl iso-8859-6 >> cp
./cpgen.pl iso-8859-7 >> cp
./cpgen.pl iso-8859-8 >> cp
./cpgen.pl iso-8859-9 >> cp
./cpgen.pl iso-8859-10 >> cp
./cpgen.pl iso-8859-11 >> cp
./cpgen.pl iso-8859-13 >> cp
./cpgen.pl iso-8859-14 >> cp
./cpgen.pl iso-8859-15 >> cp
./cpgen.pl iso-8859-16 >> cp
./cpgen.pl ascii >> cp
./cpgen.pl US-ascii >> cp
./cpgen.pl ISO-646-US >> cp
./cpgen.pl ISO-646 >> cp
./cpgen.pl ascii-ctrl >> cp
./cpgen.pl latin1 >> cp
./cpgen.pl AdobeStandardEncoding >> cp
./cpgen.pl MacRoman >> cp
./cpgen.pl nextstep >> cp
./cpgen.pl hp-roman8 >> cp
./cpgen.pl MacCentralEurRoman >> cp
./cpgen.pl MacCroatian >> cp
./cpgen.pl MacRomanian >> cp
./cpgen.pl MacRumanian >> cp
./cpgen.pl Latin3 >> cp
./cpgen.pl Latin4 >> cp
./cpgen.pl MacCyrillic >> cp
./cpgen.pl MacUkrainian >> cp
./cpgen.pl Arabic >> cp
./cpgen.pl MacArabic >> cp
./cpgen.pl MacFarsi >> cp
./cpgen.pl Greek >> cp
./cpgen.pl MacGreek >> cp
./cpgen.pl Hebrew >> cp
./cpgen.pl MacHebrew >> cp
./cpgen.pl MacTurkish >> cp
./cpgen.pl MacIcelandic >> cp
./cpgen.pl MacSami >> cp
./cpgen.pl Thai >> cp
./cpgen.pl MacThai >> cp
./cpgen.pl Latin9 >> cp
./cpgen.pl Latin10 >> cp
./cpgen.pl viscii >> cp
./cpgen.pl koi8-f >> cp
./cpgen.pl koi8-r >> cp
./cpgen.pl koi8-u >> cp
./cpgen.pl gsm0338 >> cp
./cpgen.pl euc-cn >> cp
./cpgen.pl gbk >> cp
./cpgen.pl gb12345-raw >> cp
./cpgen.pl gb2312-raw >> cp
./cpgen.pl hz >> cp
./cpgen.pl iso-ir-165 >> cp
./cpgen.pl euc-jp >> cp
./cpgen.pl shiftjis >> cp
./cpgen.pl macJapanese >> cp
./cpgen.pl 7bit-jis >> cp
./cpgen.pl iso-2022-jp >> cp
./cpgen.pl iso-2022-jp-1 >> cp
./cpgen.pl jis0201-raw >> cp
./cpgen.pl jis0208-raw >> cp
./cpgen.pl jis0212-raw >> cp
./cpgen.pl euc-kr >> cp
./cpgen.pl iso-2022-kr >> cp
./cpgen.pl johab >> cp
./cpgen.pl ksc5601-raw >> cp
./cpgen.pl big5-eten >> cp
./cpgen.pl MacChineseTrad >> cp
./cpgen.pl big5 >> cp
./cpgen.pl big5-hkscs >> cp
./cpgen.pl posix-bc >> cp
./cpgen.pl symbol >> cp
./cpgen.pl dingbats >> cp
./cpgen.pl MacDingbats >> cp
./cpgen.pl AdobeZdingbat >> cp
./cpgen.pl AdobeSymbol >> cp
./cpgen.pl GB2312 >> cp
./cpgen.pl macarabic >> cp
./cpgen.pl macgreek >> cp
./cpgen.pl machebrew >> cp
./cpgen.pl macthai >> cp
./cpgen.pl macturkish >> cp
./cpgen.pl macjapanese >> cp
./cpgen.pl mackorean >> cp
./cpgen.pl Cyrillic >> cp
./cpgen.pl macCyrillic >> cp
./cpgen.pl ISO-8859-8 >> cp
./cpgen.pl macThai >> cp
./cpgen.pl US-ASCII >> cp
./cpgen.pl Shift_JIS >> cp
./cpgen.pl EUC-JP >> cp
./cpgen.pl ISO-2022-JP >> cp
./cpgen.pl ISO-2022-JP-1 >> cp
./cpgen.pl EUC-KR >> cp
./cpgen.pl Big5 >> cp
./cpgen.pl GB_2312-80 >> cp
./cpgen.pl EUC-CN >> cp
./cpgen.pl KOI8-U >> cp
./cpgen.pl KOI8-r >> cp
./cpgen.pl KS_C_5601-1987 >> cp
./cpgen.pl ISO-IR-165 >> cp
./cpgen.pl VISCII >> cp
./cpgen.pl UHC >> cp
./cpgen.pl x-windows-949 >> cp
./cpgen.pl GBK >> cp
./cpgen.pl SJIS >> cp
./cpgen.pl CP932 >> cp
./cpgen.pl Windows-31J >> cp
./cpgen.pl Symbol >> cp

run/unique -inp=cp cp-chars-all

@magnumripper
Copy link
Member Author

I'm not quite following what you did with that perl script.

The code page support will be needed for reading files, and for any target encoding used. Example

UTF-8 wordlist -> rules -> filters -> crk_set_key() -> LM format set_key() (using cp)

The UTF-8 will be converted to UTF-32 in wordlist.c, then stay UTF-32 all the way until cracker.c is about to call format's set_key(). Just before that, it needs to convert to eg. CP850. Note that in this very case, the current code is probably much more efficient.

@frank-dittrich
Copy link
Collaborator

We have to handle the case that the target code page doesn't include characters that might be read from a wordlist. OTOH, we do have that problem right now as well. How is that handled?

@magnumripper
Copy link
Member Author

That is an issue now and will be no matter how we re-write this. It does, and always will, result in garbage candidates.

If we at all support reading non-UTF-8 wordlists, the same applies there: What if it's written in a non-supported codepage like ArmSCII-8? Like today, we'd need to run transparent. A sensible way of handling it would be "convert assuming ISO-8859-1", "process as ASCII" [i.e. ignore non-ASCII when case-toggling and so on] and "convert back assuming ISO-8859-1". This will work pretty much the same as current -enc=raw I think.

@jfoug
Copy link
Collaborator

jfoug commented Aug 11, 2015

If we at all support reading non-UTF-8 wordlists

?? I would think forcing users to use iconv to put their wordlists into utf8 would be one option, but I bet users would NOT like that much. The main problem is 'shit' wordlists, that are a hodge podge of mixed character sets. Those dirty wordlists still abound around the net, and people like to use them. Yes, if every word in there was properly converted to utf8, then wow, it would be SO much better. But how do we help get from point dirty to point utf8 clean ? Or do we simply not care, and tell users that the word lists need to be in utf8 so there is no ambiguity within JtR?

@magnumripper
Copy link
Member Author

I think we should support it, but we could opt to not supporting it.

For mixed-encodings wordlists, that "transparent mode" concept is mandatory. However, it won't work well for Unicode hashes like NT. I never has and never will, in any cracker - it simply can't.

BTW a problem with "transparent mode" is your pot file entries will not be UTF-8. Ideally we should have a field in the .pot file stating this is the case, and for such entries -show would print eg.

Administrator:M.ller [4d 81 6c 6c 65 72]

That's "Müller" in CP850.

@magnumripper
Copy link
Member Author

I'm digressing now, but ideally the pot file format would always be

<hash> : <encoding> : <hexdump of plain in target encoding>

That would work with tabs, colons and whatever, and with any encoding including transparent [== raw == unknown] encoding. It would always be totally reproducible.

@jfoug
Copy link
Collaborator

jfoug commented Aug 11, 2015

We may want to have the .pot file enhanced if we allow anything OTHER than utf8 to be stored in it. The enhancement would be to somehow deliver across information about the encoding.

Lol, you beat me to the punch

@magnumripper
Copy link
Member Author

I have considered various ideas over time:

  1. Simply change the pot file format, adding fields and functionality. Possibly also start defaulting to using TAB as separator. Or simply use XML or CSV (but quoting will be hell). Actually, the "hex dump" approach as above is the safest and simplest (and we could add more fields to it if we wanted).
  2. Use a different pot file for raw mode. This will be transparent for the user, she won't notice the difference. Known encodings will be (by default) in john.pot and raw stuff will be in john.raw.pot - or something like that. When you use -show, it will read both files.
  3. A variation of (2): Keep current john.pot as today but add more information in a second file. This is tricky and error prone.

@jfoug
Copy link
Collaborator

jfoug commented Aug 11, 2015

I actually REALLY like that .pot (.pv2 ?) file layout. It is absolutely concise. If we changed it to this:

hash : \t $V2$ \t encoding \t plain-in-hex : plain

Then we would not have to change the .pot file at all. We could still read the legacy stuff, but if the line contains $V2$ as the found password, and has \tsome-valid-encoding\t following that, then we know this is a V2 line, and handle it appropriately and THEN the .pot file actually could have 'mixed' cp data, have the "REAL" data in the file, etc. Actually, would we really 'need' the plain-in-hex ?

@magnumripper
Copy link
Member Author

For transparent mode, the plain hex is the only thing that will ever always be proper.

@jfoug
Copy link
Collaborator

jfoug commented Aug 11, 2015

I would really poo-poo XML. Yes it you can do anything with it, BUT it is slow, and is a nightmare for stuff like this, where quoting will have to be done all over the place.

CSV is no better (we already are CSV, but with 2 fields, and the separator being the 'first' ':' seen on the line.

@magnumripper
Copy link
Member Author

Also, consider the case where a user fed a CP123 wordlist but stated CP234, and by coincidence some LM hashes was cracked. That plaintext-as-hex will show what was ACTUALLY the password, no matter the encoding, as recorded, was incorrect (and as a result, a Unicode print from -show will be incorrect too).

Yes, I hate XML. And the current format is just a variation of CSV.

@jfoug
Copy link
Collaborator

jfoug commented Aug 11, 2015

Should these last few comments be moved to it's own topic, Possibly as an RFC type ?

@magnumripper
Copy link
Member Author

We should probably have discussed all this on john-dev instead...

@magnumripper
Copy link
Member Author

I wish john-dev was a web forum, with GitHub markup 😄

@jfoug
Copy link
Collaborator

jfoug commented Aug 11, 2015

It is not bad to get things pie in the sky talked about offline, then bring to john-dev, but usually only crickets chirp there.

@jfoug
Copy link
Collaborator

jfoug commented Aug 11, 2015

I hate the email lists. So difficult to find anything. Yes, the github is also a cluster, trying to find old stuff, BUT for doing 'hot in the trenches' stuff, it makes following along very easy

@magnumripper magnumripper changed the title Add FMT_SET_KEY_32 format flag Add FMT_UTF32 format flag Aug 11, 2015
@magnumripper
Copy link
Member Author

This new flag should be named FMT_UTF32 and not only set_key() should be affected but also get_key().

@magnumripper
Copy link
Member Author

I think this also obsoletes both FMT_UNICODE and FMT_UTF8 flags.

@magnumripper
Copy link
Member Author

So, if set_key can be either of

static void set_key(char *key, int index);

and

static void set_key(UTF32 *key, int index);

What is the canonical way to declare it? I see at least two solutions. One is to declare it as void pointer in formats.h:

static void set_key(void *key, int index);

and the other is to actually add a function to the struct,

static void set_key(char *key, int index);
static void set_key32(UTF32 *key, int index);

If we go with the latter, any format will probably have one of them as NULL and the other one defined. Actually that would mean we don't strictly need the FMT_UTF32 flag, we could just test if fmt->methods.set_key32 is NULL or not.

@magnumripper
Copy link
Member Author

We have to handle the case that the target code page doesn't include characters that might be read from a wordlist. OTOH, we do have that problem right now as well. How is that handled?

That is an issue now and will be no matter how we re-write this. It does, and always will, result in garbage candidates.

On a second thought: With the proposed changes, we'd detect in cracker.c that the target encoding can't hold the needed characters - so we could reject the candidate and never send it to the format. This is a good thing.

@jfoug
Copy link
Collaborator

jfoug commented Aug 12, 2015

So, if set_key can be either of

I think we need to step back a bit. We might want to totally relook at the interface. It may be time to provide an interface that the conversion code dumps data right into the key space (on set_key). Get_key IMHO is not hot code. Finding a crack is NOT expected, it is an exception. But setting keys IS hot code.

What is the canonical way to declare it?

Switch to a real language 😈 where issues like this are simply handled by the compiler.

@jfoug
Copy link
Collaborator

jfoug commented Aug 12, 2015

We might want to totally relook at the interface.

For example:

within converter (that marshals data from the crack type into the format), the format during init() would simply make 'setter' type calls into the converter, providing it with all the information that is required to properly load the data (number of buffers, layout of buffers, size of buffers, byte ordering of buffers, whether to MD the buffers, provided alignment, buffer cleaning required, etc). We would likely have to still provide the existing interface, where 1 word at a time is sent to the format to have it load into its memory layout, for formats which have complex (or extremely simple) data layouts. Formats that simply work on a single password at a time, could likely keep the existing naive setkey() layout.

@magnumripper
Copy link
Member Author

Get_key IMHO is not hot code.

No but it's logical to return a key in same format you got it. And it simplifies format code, the conversion back to UTF-8 would be in cracker.c instead of duplicated in all Unicode formats.

@magnumripper
Copy link
Member Author

BTW for putting code where most effective, we should actually have it as FMT_UTF16. This means set_key() and get_key() will set/get UTF-16 with surrogates - BUT all of core should still have it as UTF-32. This will mean really simple code in formats.

On another note, while doing all this we'll ensure the key delivered from cracker.c is aligned properly for any use (ie. aligned to 8) no matter if it's char* or UTF16*.

@magnumripper magnumripper changed the title Add FMT_UTF32 format flag Add FMT_UTF16 format flag Aug 12, 2015
@magnumripper
Copy link
Member Author

What is the canonical way to declare it?

Switch to a real language 😈 where issues like this are simply handled by the compiler.

Real code don't need a compiler, only an assembler 😈

@frank-dittrich
Copy link
Collaborator

Why an assembler? http://www.catb.org/jargon/html/story-of-mel.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants