Skip to content
mikesuhan edited this page Jul 29, 2017 · 2 revisions

Keyness calculates log likelihood for every token in a corpus and rank orders the tokens by log likelihood. The input must consist of a corpus and a reference corpus. Both should be iterables of tokenized or ngram-ized texts.

Input

Example input of tokenized texts:

corpus = [['the', 'banana', 'is', 'such', 'a', 'great', 'fruit'], ['i', 'really', 'love', 'banana', 'splits', 'and', 'banana', 'pudding'], ['anna', 'banana', 'had', 'a', 'banana', 'for', 'breakfast']]

reference_corpus = [['any', 'fruit', 'is', 'such', 'a', 'great', 'breakfast'], ['i', 'really', 'love', 'apple', 'splits', 'and', 'bread', 'pudding'], ['anna', 'peterson', 'had', 'a', 'pear', 'for', 'lunch']]

Example input of ngram-ized texts:

corpus = [[('the', 'banana', 'is', 'such'), ('banana', 'is', 'such', 'a'), ('is', 'such', 'a', 'great'), ('such', 'a', 'great', 'fruit')], [('i', 'really', 'love', 'banana'), ('really', 'love', 'banana', 'splits'), ('love', 'banana', 'splits', 'and'), ('banana', 'splits', 'and', 'banana'), ('splits', 'and', 'banana', 'pudding')], [('anna', 'banana', 'had', 'a'), ('banana', 'had', 'a', 'banana'), ('had', 'a', 'banana', 'for'), ('a', 'banana', 'for', 'breakfast')]]

reference_corpus = [[('any', 'fruit', 'is', 'such'), ('fruit', 'is', 'such', 'a'), ('is', 'such', 'a', 'great'), ('such', 'a', 'great', 'breakfast')], [('i', 'really', 'love', 'apple'), ('really', 'love', 'apple', 'splits'), ('love', 'apple', 'splits', 'and'), ('apple', 'splits', 'and', 'bread'), ('splits', 'and', 'bread', 'pudding')], [('anna', 'peterson', 'had', 'a'), ('peterson', 'had', 'a', 'pear'), ('had', 'a', 'pear', 'for'), ('a', 'pear', 'for', 'lunch')]]

Calculating Log Likelihood

Log likelihood based on a frequency distribution

from keyness import log_likelihood

output = log_likelihood(corpus, reference_corpus)

Log likelihood based on a type distribution

To calculate log likelihood based on the number of types (rather than tokens) in a text, use the dist_func keyword argument.

from keyness import log_likelihood

output = log_likelihood(corpus, reference_corpus, dist_func=keyness.type_dist)

Setting the rate frequencies are normalized to

By default, frequencies are normalized to the rate per 1,000 words, but this can be changed using the norm_to keyword argument. The example below would normalize the frequencies to a rate of per 100 words.

from keyness import log_likelihood

output = log_likelihood(corpus, reference_corpus, norm_to=100)

Output

A list of tuples

By default, output is a list of tuples. Each tuple contains the following data:

  1. a token or ngram
  2. its log likelihood value
  3. its frequency in the corpus
  4. its rate in the corpus
  5. its frequency in the reference corpus
  6. its rate in the reference corpus

Using the default keyword arguments, the output based on frequency distributions from the tokenized texts example above would look like this:

[('banana', 4.452, 3, 150.0, 0, 0.0), ('the', 1.484, 1, 50.0, 0, 0.0), ('bread', 1.293, 0, 0.0, 1, 45.455), ('peterson', 1.293, 0, 0.0, 1, 45.455), ('apple', 1.293, 0, 0.0, 1, 45.455), ('lunch', 1.293, 0, 0.0, 1, 45.455), ('pear', 1.293, 0, 0.0, 1, 45.455), ('any', 1.293, 0, 0.0, 1, 45.455), ('a', 0.009, 2, 100.0, 2, 90.909), ('love', 0.005, 1, 50.0, 1, 45.455), ('such', 0.005, 1, 50.0, 1, 45.455), ('and', 0.005, 1, 50.0, 1, 45.455), ('fruit', 0.005, 1, 50.0, 1, 45.455), ('anna', 0.005, 1, 50.0, 1, 45.455), ('breakfast', 0.005, 1, 50.0, 1, 45.455), ('pudding', 0.005, 1, 50.0, 1, 45.455), ('really', 0.005, 1, 50.0, 1, 45.455), ('i', 0.005, 1, 50.0, 1, 45.455), ('had', 0.005, 1, 50.0, 1, 45.455), ('for', 0.005, 1, 50.0, 1, 45.455), ('is', 0.005, 1, 50.0, 1, 45.455), ('splits', 0.005, 1, 50.0, 1, 45.455), ('great', 0.005, 1, 50.0, 1, 45.455)]

Saving the output

The output can be saved as a .tsv file using the save_as keyword argument. The value of _save_as will be the filepath and name.

log_likelihood(corpus, reference_corpus, save_as="example.tsv")

The delimiter keyword argument determines what items in each row are delimited by. The following will create a .csv file:

log_likelihood(corpus, reference_corpus, save_as="example.csv", delimiter=",")