Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status Checkup #24

Open
clayrisser opened this issue Sep 5, 2017 · 21 comments
Open

Status Checkup #24

clayrisser opened this issue Sep 5, 2017 · 21 comments

Comments

@clayrisser
Copy link

This project is sooo historically cool. I would love to know the status of the project. I haven't seen any activity for several months. I'm also will to contribute if you need more manpower.

@WardCunningham
Copy link
Owner

Most of my attention goes into my new wiki, federated wiki, which is well positioned to be historically cool in a few decades. See http://about.fed.wiki.

Still, thanks for asking. I continue to pay co-location fees because I don't want to lose the few hundred pages that I wasn't able to recover mostly due to mixed character encoding problems.

I'm not looking for advice on how to become a better programmer. But I would appreciate some help. I could put together some tar files with troublesome pages and various backups. If you or anyone else had a good approach for converting these to utf-8 I'd love to see this work done.

@clayrisser
Copy link
Author

What encoding are the troublesome files in?

@clayrisser
Copy link
Author

clayrisser commented Sep 6, 2017

So, is http://wiki.c2.com/ going to be permanently frozen, or are there plans on opening it up again?

@clayrisser
Copy link
Author

I read through the explanations of the Federation wiki, but it's pretty dense, and I don't fully understand its purpose.

@WardCunningham
Copy link
Owner

WardCunningham commented Sep 7, 2017

The troublesome files are in mixed encoding, having been edited by a variety of browser at a time where utf-8 was uncommon. Federated wiki offers an alternative (and editable) view into historic wiki pages. This javascript is more faithful to the original perl code. Wiki is more a medium, like paper, than a tool with a purpose, like a stapler. Federated wiki is a medium for doing work as well as talking about doing work. I do have trouble devoting energy here in the past but I would be glad to work on it with others.

@maxlybbert
Copy link

If you’re still interested in fixing the “troublesome files,” it sounds like an interesting problem. I’m not aware of any existing tool to autodetect the encoding of a part of a file, but I’m optimistic. Would you have time to get a few samples?

@WardCunningham
Copy link
Owner

WardCunningham commented Dec 1, 2018

I will prepare a sampling of troublesome pages and post a link to a tar file here. This repo has the ruby and c programs I used to convert most pages to json. The c program, json.c => a.out, converts troublesome characters to something that can be recognized by the ruby program, json.rb. The one character I had to convert to get anything working was the ASCII GS (group separator) character that I had used in my original perl code to separate groups. I suspect a large number of troublesome files can be handled by adding more cases to json.c. But what are the cases? That is the question that slowed down my conversion.

Here I resort again to perl to count byte occurrences in a known good file.

cat wiki.wdb/WardCunningham | \
  perl -e 'while(read STDIN,$text,1024){@bytes=unpack"C*",$text;for(@bytes){printf"%03o\n",$_}}' | \
  sort | uniq -c

Where for this file I get these counts:

      8 011
    975 012
    975 015
  10451 040
     14 041
    122 042
      4 045
     24 046
    930 047
     75 050
     81 051
    202 052
      2 053
    415 054
    707 055
    945 056
    277 057
    150 060
     82 061
    119 062
     46 063
     48 064
     58 065
     44 066
     37 067
     48 070
     46 071
    121 072
     12 073
     26 075
    126 077
      4 100
    142 101
    120 102
    165 103
     86 104
    120 105
     77 106
     43 107
     94 110
    455 111
     46 112
     24 113
     88 114
    125 115
     65 116
    106 117
    146 120
     12 121
     72 122
    261 123
    227 124
     36 125
     14 126
    392 127
     10 130
     14 131
      2 132
     10 133
     10 135
     26 137
   3852 141
    751 142
   1518 143
   1855 144
   5889 145
    945 146
   1259 147
   2160 150
   3900 151
     70 152
    712 153
   2133 154
   1397 155
   3585 156
   4340 157
   1159 160
     44 161
   3226 162
   3409 163
   4668 164
   1582 165
    518 166
   1091 167
    198 170
    887 171
     41 172
      2 176
     19 263

Below 040 are ASCII control codes. Here we see TAB, LF and CR.
Above 177 are 8-bit codes, 7-bits plus the high bit, 200 octal.
I see here that I'm using octal code 263 as group separator.
I vaguely remember switching to this unlikely code but don't remember why.

@WardCunningham
Copy link
Owner

I've put together a list of troublesome pages.

http://c2.com/wiki/remodel/trouble.txt

I'm also serving these in their original (not html) format which would feed into the json.c and json.rb scripts in this repo. I've picked one largish example from this list for character distribution analysis.

http://wiki.c2.com/?XpAsTroubleDetector

Here I run the perl script from above on this using the original format file as input:

curl -s http://c2.com/wiki/remodel/trouble/XpAsTroubleDetector | \
  perl -e 'while(read STDIN,$text,1024){@bytes=unpack"C*",$text;for(@bytes){printf"%03o\n",$_}}' | \
  sort | uniq -c

Where for this input I get these counts:

  10 011
  73 012
  73 015
3627 040
   1 041
  42 042
 139 047
  12 050
  12 051
  49 054
  32 055
7872 056
10335 057
2146 060
8515 061
2874 062
 979 063
1234 064
3384 065
3407 066
1013 067
1115 070
 888 071
2599 072
   3 073
  15 077
   6 101
   6 102
   5 103
   6 104
   2 105
   4 106
   4 107
   6 110
  35 111
   1 112
   2 113
   3 114
   5 115
   2 116
   3 117
  28 120
  11 123
  18 124
   5 125
   1 126
  11 127
  26 130
   3 131
2588 133
2588 135
2907 141
  61 142
 198 143
2674 144
3123 145
 108 146
2855 147
5547 150
5356 151
  24 152
  39 153
2790 154
2752 155
7891 156
 522 157
2825 160
   6 161
 281 162
 342 163
10810 164
 154 165
  36 166
 252 167
   9 170
2613 171
   3 172
   1 242
   1 245
   1 250
  19 260
   2 262
  15 263
   1 264
  20 265
   1 266
   9 267
   1 270
   2 273
   2 276
   2 277
   4 302
   6 303
   1 305
   3 312
   2 314
   2 315
   7 317
   1 320
   8 321
  19 323
   7 324
   3 325
   1 326
   1 327
   2 330
   5 332
   1 333
   5 337
   1 341
   6 342
  20 347
   1 354
   1 355
   3 370
   1 372

It's possible that this is a particularly tough case. Some sort of systematic study is in order ranking troublesome page names by, say, the number of unexpected character codes.

To aid in such a study I have assembled all troublesome files in one compressed tar file.

http://c2.com/wiki/remodel/trouble.tgz

I would be pleased to see some progress on any substantial number of these file.

@WardCunningham
Copy link
Owner

Some progress in this pull request: #32

@maxlybbert
Copy link

I’m sorry I didn’t look at this over the weekend. Even so, you’ve made a lot of progress pretty quickly. Hopefully I’ll be able to do something helpful before you’ve solved the problem.

@WardCunningham
Copy link
Owner

The thing I had missed was the Chinese spam. Most often it had been reverted but since I kept a copy of the last version in the same file the characters there killed my ruby program. The other insight I was missing was that I had line oriented files and could narrow my problem characters down to one line. Still plenty of random characters from pre-utf-8 character encodings.

@maxlybbert
Copy link

maxlybbert commented Dec 6, 2018

I've played around with Perl's Encode::Guess module, and the early results are promising. I used the following script, and most of the non-utf8 portions are in the Windows' version of Latin1. And most of the exceptions are the Chinese spam:

#!/bin/env perl

# improved Unicode support starting with 5.14
use v5.14;
use warnings;

use constant codepages => qw{WinLatin1 latin1 euc-jp shiftjis 7bit-jis};

use Encode;
use Encode::Guess codepages;
use List::Util 'all';

Encode::Guess->set_suspects(codepages);

binmode STDOUT, ':utf8';

for my $filename (map { glob } @ARGV) {
	my $fh;
	if (!open $fh, '<', $filename) {
		warn "cannot open $filename: $!\n";
		next;
	}
	while (my $line = <$fh>) {
		chomp $line;
		# Even though utf8 is defined so that 7-bit ASCII is valid utf8,
		# utf8::is_utf8 returns false when given a string of just
		# 7-bit ASCII.  So test for 7-bit ASCII separately (and assume
		# it's encoded correctly).
		if (all { ord($_) < 128 } split //, $line) {
			next;
		}
		# It is possible to get a false negative (e.g., Latin1 text
		# which happens to have all characters with values above 127
		# followed by characters with values of 127 or less), but
		# it's very unlikely.
		if (utf8::is_utf8($line)) {
			next;
		}

		my $enc = guess_encoding($line);
		if (!defined $enc) {
			warn "cannot guess encoding for $line\n";
			next;
		}
		if (ref $enc) {
			say "$filename:$. (" . $enc->name . ")\t"
				. $enc->decode($line);
			next;
		}
		for (split /\s+or\s+/, $enc) {
			say "$filename:$. ($_)\t" . Encode::decode($_, $line);
		}
	}
}

I believe it wouldn't be hard to write a script to filter out the spam and correct the encodings (convert to Windows Latin1 by default, but mark specific files that need a different conversion). I'll do that next.

@maxlybbert
Copy link

Oops. utf8::is_utf8 doesn’t do what I thought. The script should be:

#!/bin/env perl

# improved Unicode support starting with 5.14
use v5.14;
use warnings;

use constant codepages => qw{WinLatin1 latin1 euc-jp shiftjis 7bit-jis};

use Encode;
use Encode::Guess codepages;

Encode::Guess->set_suspects(codepages);

binmode STDOUT, ':utf8';

for my $filename (map { glob } @ARGV) {
	my $fh;
	if (!open $fh, '<', $filename) {
		warn "cannot open $filename: $!\n";
		next;
	}
	while (my $line = <$fh>) {
		chomp $line;
		my $copy = $line;
		if (utf8::decode($copy)) {
			next;
		}

		my $enc = guess_encoding($line);
		if (!defined $enc) {
			warn "cannot guess encoding for $line\n";
			next;
		}
		if (ref $enc) {
			say "$filename:$. (" . $enc->name . ")\t"
				. $enc->decode($line);
			next;
		}
		for (split /\s+or\s+/, $enc) {
			say "$filename:$. ($_)\t" . Encode::decode($_, $line);
		}
	}
}

@WardCunningham
Copy link
Owner

WardCunningham commented Dec 6, 2018

This is an amazingly helpful script. I thought it might be possible but didn't know enough about encoding to even begin.

I added a substitution for the $SEP character I used in my serializations. I know it won't collide with any other alphabet because I removed them from submitted text on save before I serialize.

my $SEP = "\263";

Can I assume that the result of $enc->decode($line); is utf-8? If so, it seems like I have all of the pieces I need to convert 99% of my files.

Aside: Wikipedia has been helpful explaining each of the encodings suggested by your script.

@maxlybbert
Copy link

maxlybbert commented Dec 6, 2018

I re-read the documentation to be sure about whether $enc->encode($line) always returns utf-8. It does, with the caveat that $enc can be either an object that can convert to utf-8 or an error message. I got that wrong: I thought it was a list of candidates, which is I have the split. I really got that wrong, because presumably there is some text before the first encoding name, and I don’t strip it out. But I already have a list of the code pages I asked for, so there’s no need to try to figure out that list from $enc.

As it currently exists, the script has serious problems. But I am glad that it provides a decent starting point for an actual conversion script.

@maxlybbert
Copy link

I checked what $enc has on error, and it does get an “or”-separated list of candidates. Which is nice, since Encode::Guess figures out the encoding only 127 times, compared to 23,000 times where more than one code page could be right.

@maxlybbert
Copy link

maxlybbert commented Jan 17, 2019

I recently discovered that ICU ( http://icu-project.org ) supports encoding detection, so I wrote a short C++ program that detects the encoding, line-by-line, and actually performs the encoding. Unfortunately, some encodings that ICU detects aren't properly set up on my computer (e.g., IBM424_rtl and IBM424_rtl), so actually trying those encodings fails when I run my program. Those encodings seem to show up mainly in spam links, so getting them properly decoded may not be such a big issue. It so happens that falling back to reading that text as UTF-8 gives me mojibake, but doesn't throw an error. You may have better luck on a different computer.

Github won't allow me to attach a tarball of the processed files. I would be happy to email it to you, or send it some other way. I have attached the C++ program (as a .txt, because Github won't accept it with a .cc extension). It's not an efficient program (it uses functions that ICU refers to as inefficient convenience functions), but it runs fast enough for me.
icudet.txt

@maxlybbert
Copy link

I have some changes I want to make to my C++ program. I think I’m wrong about getting mojibake when I fall back to encoding by UTF-8. Instead, I think I’m getting “invalid conversion” characters.

I won’t be able to fix the program until tonight at the earliest. If you want to make the changes: I plan to ask ICU to give me a list of candidates (instead of just the best candidate) and exhaust those before I fall back to just trying everything, plus I plan to change the check for whether something was successfully decoded.

@WardCunningham
Copy link
Owner

Thank you for your continued effort here.

There are often two copies of a page in each file. If the spam associated encodings are in one version only that would indicate a preference for the other. This, can ruby read it, was my first discrimination between copies and seemed to handle a lot of cases. This might be asking a lot of your program unless it is already unpacking the parts.

@maxlybbert
Copy link

I’m currently only going line-by-line. I don’t think it would be hard to process just the de-spammed portion of each file, though.

@marnen
Copy link

marnen commented Nov 13, 2019

I would very much like to help get the remaining wiki pages operational, but the tarball is hosted on c2.com, which now seems to be down. Can we make a GitHub repo with the remaining page content, and use the pull request workflow to facilitate the cleanup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants