-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Status Checkup #24
Comments
Most of my attention goes into my new wiki, federated wiki, which is well positioned to be historically cool in a few decades. See http://about.fed.wiki. Still, thanks for asking. I continue to pay co-location fees because I don't want to lose the few hundred pages that I wasn't able to recover mostly due to mixed character encoding problems. I'm not looking for advice on how to become a better programmer. But I would appreciate some help. I could put together some tar files with troublesome pages and various backups. If you or anyone else had a good approach for converting these to utf-8 I'd love to see this work done. |
What encoding are the troublesome files in? |
So, is http://wiki.c2.com/ going to be permanently frozen, or are there plans on opening it up again? |
I read through the explanations of the Federation wiki, but it's pretty dense, and I don't fully understand its purpose. |
The troublesome files are in mixed encoding, having been edited by a variety of browser at a time where utf-8 was uncommon. Federated wiki offers an alternative (and editable) view into historic wiki pages. This javascript is more faithful to the original perl code. Wiki is more a medium, like paper, than a tool with a purpose, like a stapler. Federated wiki is a medium for doing work as well as talking about doing work. I do have trouble devoting energy here in the past but I would be glad to work on it with others. |
If you’re still interested in fixing the “troublesome files,” it sounds like an interesting problem. I’m not aware of any existing tool to autodetect the encoding of a part of a file, but I’m optimistic. Would you have time to get a few samples? |
I will prepare a sampling of troublesome pages and post a link to a tar file here. This repo has the ruby and c programs I used to convert most pages to json. The c program, json.c => a.out, converts troublesome characters to something that can be recognized by the ruby program, json.rb. The one character I had to convert to get anything working was the ASCII GS (group separator) character that I had used in my original perl code to separate groups. I suspect a large number of troublesome files can be handled by adding more cases to json.c. But what are the cases? That is the question that slowed down my conversion. Here I resort again to perl to count byte occurrences in a known good file.
Where for this file I get these counts:
Below 040 are ASCII control codes. Here we see TAB, LF and CR. |
I've put together a list of troublesome pages. http://c2.com/wiki/remodel/trouble.txt I'm also serving these in their original (not html) format which would feed into the json.c and json.rb scripts in this repo. I've picked one largish example from this list for character distribution analysis. http://wiki.c2.com/?XpAsTroubleDetector Here I run the perl script from above on this using the original format file as input:
Where for this input I get these counts:
It's possible that this is a particularly tough case. Some sort of systematic study is in order ranking troublesome page names by, say, the number of unexpected character codes. To aid in such a study I have assembled all troublesome files in one compressed tar file. http://c2.com/wiki/remodel/trouble.tgz I would be pleased to see some progress on any substantial number of these file. |
Some progress in this pull request: #32 |
I’m sorry I didn’t look at this over the weekend. Even so, you’ve made a lot of progress pretty quickly. Hopefully I’ll be able to do something helpful before you’ve solved the problem. |
The thing I had missed was the Chinese spam. Most often it had been reverted but since I kept a copy of the last version in the same file the characters there killed my ruby program. The other insight I was missing was that I had line oriented files and could narrow my problem characters down to one line. Still plenty of random characters from pre-utf-8 character encodings. |
I've played around with Perl's
I believe it wouldn't be hard to write a script to filter out the spam and correct the encodings (convert to Windows Latin1 by default, but mark specific files that need a different conversion). I'll do that next. |
Oops.
|
This is an amazingly helpful script. I thought it might be possible but didn't know enough about encoding to even begin. I added a substitution for the $SEP character I used in my serializations. I know it won't collide with any other alphabet because I removed them from submitted text on save before I serialize.
Can I assume that the result of Aside: Wikipedia has been helpful explaining each of the encodings suggested by your script. |
I re-read the documentation to be sure about whether As it currently exists, the script has serious problems. But I am glad that it provides a decent starting point for an actual conversion script. |
I checked what |
I recently discovered that ICU ( http://icu-project.org ) supports encoding detection, so I wrote a short C++ program that detects the encoding, line-by-line, and actually performs the encoding. Unfortunately, some encodings that ICU detects aren't properly set up on my computer (e.g., IBM424_rtl and IBM424_rtl), so actually trying those encodings fails when I run my program. Those encodings seem to show up mainly in spam links, so getting them properly decoded may not be such a big issue. It so happens that falling back to reading that text as UTF-8 gives me mojibake, but doesn't throw an error. You may have better luck on a different computer. Github won't allow me to attach a tarball of the processed files. I would be happy to email it to you, or send it some other way. I have attached the C++ program (as a .txt, because Github won't accept it with a .cc extension). It's not an efficient program (it uses functions that ICU refers to as inefficient convenience functions), but it runs fast enough for me. |
I have some changes I want to make to my C++ program. I think I’m wrong about getting mojibake when I fall back to encoding by UTF-8. Instead, I think I’m getting “invalid conversion” characters. I won’t be able to fix the program until tonight at the earliest. If you want to make the changes: I plan to ask ICU to give me a list of candidates (instead of just the best candidate) and exhaust those before I fall back to just trying everything, plus I plan to change the check for whether something was successfully decoded. |
Thank you for your continued effort here. There are often two copies of a page in each file. If the spam associated encodings are in one version only that would indicate a preference for the other. This, can ruby read it, was my first discrimination between copies and seemed to handle a lot of cases. This might be asking a lot of your program unless it is already unpacking the parts. |
I’m currently only going line-by-line. I don’t think it would be hard to process just the de-spammed portion of each file, though. |
I would very much like to help get the remaining wiki pages operational, but the tarball is hosted on c2.com, which now seems to be down. Can we make a GitHub repo with the remaining page content, and use the pull request workflow to facilitate the cleanup? |
This project is sooo historically cool. I would love to know the status of the project. I haven't seen any activity for several months. I'm also will to contribute if you need more manpower.
The text was updated successfully, but these errors were encountered: