Skip to content
douglasbagnall edited this page Jul 12, 2011 · 3 revisions

Filtering the Hose

Measuring the probability that a tweet is in English

The text is encoded as utf-8, processed to normalise case, punctuation, and whitespace, then broken into byte trigrams. For example:

'I am  Pat!' => '<i>', '<am', 'am>', '<pa', 'pat', 'at>'

The < and > represent word boundaries (different treatments can be chosen via the --trigram-mode argument, but they don't make much difference). The trigrams will break up long utf-8 character sequences so, for example, most trigrams of Chinese will be utterly meaningless. That is OK.

Numerals are discarded, unless they form part of a word. Most punctuation goes, though not in particular the apostrophe which is a useful marker of English. Words beginning with '@', '#', or 'http' are ignored. The token 'RT' is ignored.

Four or more repetitions of one or two character sequences (as in, respectively, "LOOOOOOOL!" and "hahahahahaha") are condensed to three repetitions ("LOOOL!", "hahaha") before the trigrams are calculated.

The log odds for each trigram are:

log( P(trigram | English) / P(trigram | Non-English) )

where P(trigram | English) is:

(count + offset) / (total + offset * number of possible trigrams)

The offset gives unseen trigrams a chance. The result can be seen as the weight of evidence in favour of English, measured in bits, and the score for each tweet is the mean of these values for all trigrams. It works fast because it is ends up being just a dictionary look-up per trigram.

The non-English hypothesis uses the same formula. It is based on a blend of Spanish, Portuguese, German, Dutch, and Indonesian with a higher offset giving unseen trigrams a higher probability than the English model does. Thus trigrams that neither model knows (and which are probably non-ascii) will confirm non-English more than English.

Notes on the model

The "number of possible trigrams" used is 256 ** 3, not the actual number of possible 3 byte extracts from a valid UTF-8 string, which is quite a bit lower. This affects the normalisation of the probabilities, and thus dents the claim that the scores represent "bits". But I don't think it affects the score as a ranking.

(Just to explain this a bit: in utf-8 11111110 and 11111111 and some other numbers can never occur, and a 11xxxxxx byte has to be followed by a number of 10xxxxxx bytes (how many depends on the exact nature of the first byte's xxxxxx). So some of the 16M trigrams are impossible, but counting those is non-trivial, not least because the byte trigram can start in the middle of a utf-8 sequence or drop its end. A closer upper bound would be 243 ** 3 ~= 14M, because there are 13 byte values that should never occur. But it really doesn't matter that much.

For modelling English it doesn't matter whether the trigrams are valid utf-8 or not.

The trigram representation is a legacy of the cosine technique. Though it works well enough and efficiently and all, the better thing to be doing would be to make a series of predictions and compare the results. The trigram model doesn't quite do this (well, it does sort of). Here's an example:

The states "piz", "izz", "zza", and "za>" are rare in English, but after getting into the "piz" state, that is the correct recovery sequence. The current model punishes every step of the way, while a better model might say "bad start, but you're on the right track". A complicating factor is that "pizza" is a word in just about every language, so it might be a net predictor against English, even when broken down into its un-English trigrams. What I suppose it boils down to is a need for more/adaptive state.

The proper way to do things would be to have models for all the various languages, rather than one big non-English glob.

Corpora

These models are based on text corpora, from which the frequency of trigrams is found.

English corpora

carroll-alice.txt          Project Gutenberg book
dasher_training_english_GB.txt  from Dasher project
bash-org.txt               puerile irc/IM quotes
english-web.txt            Newish translation of Genesis
irc.txt.gz                 IRC logs (mostly technical channels)
enron-sent.txt.gz          Sent mail from Enron employees.
lulz.txt                   LULZ-Sec press releases and similar items
presidents.txt             Recent-ish American presidential speeches
wikipedia.txt              Wikipedia articles and discussions
barely-english.txt         English tweets mistaken for non-English

The Enron corpus (the entire email archive of Enron, which some academic managed to acquire at the end of the court cases) has gigabytes of 2000-ish email text, mixing dot-com business talk with email chain jokes and little notes about kids and weekends. But no "abt 2 eat sumthn n den sleep! LMAO".

The wikipedia corpus is based on articles with long talk pages, on the assumption that urgently discussed talk pages would better approximate the twitter register.

The irc logs are mostly programming channels like #gstreamer. The lulz texts were harvested from Pastebin.

Non-English corpora

anti-english.txt        *.wikipedia.org articles
near-english.txt        tweets mistaken for English

The anti-english corpus is based on the Spanish, Portuguese, German, Dutch, French and Indonesian wikipedias, while near-english is based on non-English tweets from the models were not able to easily decide upon.

Extracting borderline tweets from a dump file

./filter_the_hose.py -T -d <somewhere.txt> [ -i <stash-file> ]

will give you a list of tweets ordered by score. If you grep for words common in only one model, you gather up the other text on that line that is likely to also belong to that model. For example, to find Portuguese tweets, you could start with this:

for word in mundo das país; do
    grep -iw $word src.txt >> tmp.txt
done
sort -u tmp.txt > pt.txt

Then you need to look through pt.text and possibly delete some lines. In practice you would probably use more terms, perhaps from phrases cut from pt.wikipedia.org.

filter_the_hose.py usage

filter_the_hose.py in the git root directory. Without the -T switch it writes files with lines like this:

<score> <space> <screen name>

With -T (aka --trial), it writes the message instead of the screen name, which helps with diagnosis and choosing thresholds. -T uses just one drink-the-hose gzip file by default, while without it all files are processed.

With the -Q <filename> argument, where <filename> refers to previous filter_the_hose output, the users named therein are queued up for twextract-ion.

Try this:

./filter_the_hose.py --help

output the trial, sorted by score (useful for finding a threshold):

./filter_the_hose.py -T -d /tmp/ordered.txt

trial partition using default threshold:

./filter_the_hose.py -T -g /tmp/good.txt -b /tmp/bad.txt

trial filter using the score of 'raspberry' (c. 0.74) as threshold:

./filter_the_hose.py -T -g /tmp/good.txt -t 'raspberry'

partition with threshold of 0.5:

./filter_the_hose.py -T -g /tmp/good.txt -b /tmp/bad.txt  -t 0.5

trial using a different source file:

./filter_the_hose.py -T -i stash/drink-the-hose-2011051111.txt.gz -d /tmp/ordered.txt

output the user names, not tweets, for one file, using threshold 0.5:

./filter_the_hose.py -i stash/drink-the-hose-2011051111.txt.gz -g /tmp/good.txt -b /tmp/bad.txt  -t 0.5

output the user names for the entire stash:

./filter_the_hose.py -g /tmp/good.txt

as above, but also queue the users:

./filter_the_hose.py -g /tmp/good.txt -q

queue users from a previously output file listing users:

./filter_the_hose.py -Q /tmp/good.txt

Tunables

There are basically 3 numbers to tweak -- only one of which I truly understand. And there's 4 trigrammising modes to choose from.

Offset factors

--offset-factor=FLOAT [0.5]
--anti-offset-factor=FLOAT [1.0]

These determine the uniform offset added to all trigram counts, expressed as a multiple of the mean count (over all 16M "possible" trigrams, not the observed number of unique trigrams). The mean is about 5 in the English model, so the factor of 0.5 makes trigrams seen only once only barely more probable than those seen never. Which is actually right.

The main thing here is that trigrams that fit into neither model (i.e. non-latin scripts) will float to the one that has the higher factor. If the factors are close together the range of the foreign tweets expands.

If the offset_factor is too low, then strange trigrams are so surprising to the English model that one or two characters can push a ruin a tweet's score.

Threshold

--threshold=(STRING|FLOAT) ["LOL"]

Tweets with scores below this are probably not English, and with scores above are probably English. Its unit is bits per trigram. That might make you think that long messages are more certain, but in fact they're not. Long messages with borderline scores are often bilingual, and it would take tricky semantic analysis to decide which language was dominant.

The threshold can also be a string, in which case its score is evaluated, and that becomes the threshold. This could be useful for keeping the threshold steady while tweaking other things. For example, "LOL" has the score 0.435, so setting the threshold to "LOL" is the same as setting it to 0.435, but if the algorithm or parameters are changed in any way, the threshold adjusts accordingly.

Trigram mode

--trigram-mode=MODE [word_aware_lc]

This decides how the trigrams are calculated.

lowercase converts to lowercase, normalises whitespace.
lowercase_depunctuated lowercase, with most punctuation removed.
word_aware calculate trigrams separately for each word, and include upper case.
word_aware_lc like word_aware, but converting to lowercase first.

See the hashmapd.trigram documentation for more detail. The only real contenders are "lowercase_depunctuated" and "word_aware_lc".

Clone this wiki locally