-
Notifications
You must be signed in to change notification settings - Fork 0
filter_the_hose
The text is encoded as utf-8, processed to normalise case, punctuation, and whitespace, then broken into byte trigrams. For example:
'I am Pat!' => '<i>', '<am', 'am>', '<pa', 'pat', 'at>'
The < and > represent word boundaries (different treatments can be chosen via the --trigram-mode argument, but they don't make much difference). The trigrams will break up long utf-8 character sequences so, for example, most trigrams of Chinese will be utterly meaningless. That is OK.
Numerals are discarded, unless they form part of a word. Most punctuation goes, though not in particular the apostrophe which is a useful marker of English. Words beginning with '@', '#', or 'http' are ignored. The token 'RT' is ignored.
Four or more repetitions of one or two character sequences (as in, respectively, "LOOOOOOOL!" and "hahahahahaha") are condensed to three repetitions ("LOOOL!", "hahaha") before the trigrams are calculated.
The log odds for each trigram are:
log( P(trigram | English) / P(trigram | Non-English) )
where P(trigram | English) is:
(count + offset) / (total + offset * number of possible trigrams)
The offset gives unseen trigrams a chance. The result can be seen as the weight of evidence in favour of English, measured in bits, and the score for each tweet is the mean of these values for all trigrams. It works fast because it is ends up being just a dictionary look-up per trigram.
The non-English hypothesis uses the same formula. It is based on a blend of Spanish, Portuguese, German, Dutch, and Indonesian with a higher offset giving unseen trigrams a higher probability than the English model does. Thus trigrams that neither model knows (and which are probably non-ascii) will confirm non-English more than English.
The "number of possible trigrams" used is 256 ** 3, not the actual
number of possible 3 byte extracts from a valid UTF-8 string, which is
quite a bit lower. This affects the normalisation of the probabilities, and thus dents the claim that the scores represent "bits". But I don't think it affects the score as a ranking.
(Just to explain this a bit: in utf-8 11111110 and 11111111 and some other numbers can never occur, and a 11xxxxxx byte has to be followed by a number of 10xxxxxx bytes (how many depends on the exact nature of the first byte's xxxxxx). So some of the 16M trigrams are impossible, but counting those is non-trivial, not least because the byte trigram can start in the middle of a utf-8 sequence or drop its end. A closer upper bound would be 243 ** 3 ~= 14M, because there are 13 byte values that should never occur. But it really doesn't matter that much.
For modelling English it doesn't matter whether the trigrams are valid utf-8 or not.
The trigram representation is a legacy of the cosine technique. Though it works well enough and efficiently and all, the better thing to be doing would be to make a series of predictions and compare the results. The trigram model doesn't quite do this (well, it does sort of). Here's an example:
The states "piz", "izz", "zza", and "za>" are rare in English, but after getting into the "piz" state, that is the correct recovery sequence. The current model punishes every step of the way, while a better model might say "bad start, but you're on the right track". A complicating factor is that "pizza" is a word in just about every language, so it might be a net predictor against English, even when broken down into its un-English trigrams. What I suppose it boils down to is a need for more/adaptive state.
The proper way to do things would be to have models for all the various languages, rather than one big non-English glob.
These models are based on text corpora, from which the frequency of trigrams is found.
carroll-alice.txt Project Gutenberg book dasher_training_english_GB.txt from Dasher project bash-org.txt puerile irc/IM quotes english-web.txt Newish translation of Genesis irc.txt.gz IRC logs (mostly technical channels) enron-sent.txt.gz Sent mail from Enron employees. lulz.txt LULZ-Sec press releases and similar items presidents.txt Recent-ish American presidential speeches wikipedia.txt Wikipedia articles and discussions barely-english.txt English tweets mistaken for non-English
The Enron corpus (the entire email archive of Enron, which some academic managed to acquire at the end of the court cases) has gigabytes of 2000-ish email text, mixing dot-com business talk with email chain jokes and little notes about kids and weekends. But no "abt 2 eat sumthn n den sleep! LMAO".
The wikipedia corpus is based on articles with long talk pages, on the assumption that urgently discussed talk pages would better approximate the twitter register.
The irc logs are mostly programming channels like #gstreamer. The lulz texts were harvested from Pastebin.
anti-english.txt *.wikipedia.org articles near-english.txt tweets mistaken for English
The anti-english corpus is based on the Spanish, Portuguese, German, Dutch, French and Indonesian wikipedias, while near-english is based on non-English tweets from the models were not able to easily decide upon.
./filter_the_hose.py -T -d <somewhere.txt> [ -i <stash-file> ]
will give you a list of tweets ordered by score. If you grep for words common in only one model, you gather up the other text on that line that is likely to also belong to that model. For example, to find Portuguese tweets, you could start with this:
for word in mundo das país; do
grep -iw $word src.txt >> tmp.txt
done
sort -u tmp.txt > pt.txt
Then you need to look through pt.text and possibly delete some lines. In practice you would probably use more terms, perhaps from phrases cut from pt.wikipedia.org.
filter_the_hose.py in the git root directory. Without the -T switch it writes files with lines like this:
<score> <space> <screen name>
With -T (aka --trial), it writes the message instead of the screen
name, which helps with diagnosis and choosing thresholds. -T uses
just one drink-the-hose gzip file by default, while without it all
files are processed.
With the -Q <filename> argument, where <filename> refers to previous
filter_the_hose output, the users named therein are queued up for
twextract-ion.
Try this:
./filter_the_hose.py --help
output the trial, sorted by score (useful for finding a threshold):
./filter_the_hose.py -T -d /tmp/ordered.txt
trial partition using default threshold:
./filter_the_hose.py -T -g /tmp/good.txt -b /tmp/bad.txt
trial filter using the score of 'raspberry' (c. 0.74) as threshold:
./filter_the_hose.py -T -g /tmp/good.txt -t 'raspberry'
partition with threshold of 0.5:
./filter_the_hose.py -T -g /tmp/good.txt -b /tmp/bad.txt -t 0.5
trial using a different source file:
./filter_the_hose.py -T -i stash/drink-the-hose-2011051111.txt.gz -d /tmp/ordered.txt
output the user names, not tweets, for one file, using threshold 0.5:
./filter_the_hose.py -i stash/drink-the-hose-2011051111.txt.gz -g /tmp/good.txt -b /tmp/bad.txt -t 0.5
output the user names for the entire stash:
./filter_the_hose.py -g /tmp/good.txt
as above, but also queue the users:
./filter_the_hose.py -g /tmp/good.txt -q
queue users from a previously output file listing users:
./filter_the_hose.py -Q /tmp/good.txt
There are basically 3 numbers to tweak -- only one of which I truly understand. And there's 4 trigrammising modes to choose from.
--offset-factor=FLOAT [0.5] --anti-offset-factor=FLOAT [1.0]
These determine the uniform offset added to all trigram counts, expressed as a multiple of the mean count (over all 16M "possible" trigrams, not the observed number of unique trigrams). The mean is about 5 in the English model, so the factor of 0.5 makes trigrams seen only once only barely more probable than those seen never. Which is actually right.
The main thing here is that trigrams that fit into neither model (i.e. non-latin scripts) will float to the one that has the higher factor. If the factors are close together the range of the foreign tweets expands.
If the offset_factor is too low, then strange trigrams are so surprising to the English model that one or two characters can push a ruin a tweet's score.
--threshold=(STRING|FLOAT) ["LOL"]
Tweets with scores below this are probably not English, and with scores above are probably English. Its unit is bits per trigram. That might make you think that long messages are more certain, but in fact they're not. Long messages with borderline scores are often bilingual, and it would take tricky semantic analysis to decide which language was dominant.
The threshold can also be a string, in which case its score is evaluated, and that becomes the threshold. This could be useful for keeping the threshold steady while tweaking other things. For example, "LOL" has the score 0.435, so setting the threshold to "LOL" is the same as setting it to 0.435, but if the algorithm or parameters are changed in any way, the threshold adjusts accordingly.
--trigram-mode=MODE [word_aware_lc]
This decides how the trigrams are calculated.
| lowercase | converts to lowercase, normalises whitespace. |
| lowercase_depunctuated | lowercase, with most punctuation removed. |
| word_aware | calculate trigrams separately for each word, and include upper case. |
| word_aware_lc | like word_aware, but converting to lowercase first. |
See the hashmapd.trigram documentation for more detail. The only real contenders are "lowercase_depunctuated" and "word_aware_lc".