-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Keyness calculates log likelihood for every token in a corpus and rank orders the tokens by log likelihood. The input must consist of a corpus and a reference corpus. Both should be iterables of tokenized or ngram-ized texts.
corpus = [['the', 'banana', 'is', 'such', 'a', 'great', 'fruit'], ['i', 'really', 'love', 'banana', 'splits', 'and', 'banana', 'pudding'], ['anna', 'banana', 'had', 'a', 'banana', 'for', 'breakfast']]
reference_corpus = [['any', 'fruit', 'is', 'such', 'a', 'great', 'breakfast'], ['i', 'really', 'love', 'apple', 'splits', 'and', 'bread', 'pudding'], ['anna', 'peterson', 'had', 'a', 'pear', 'for', 'lunch']]
corpus = [[('the', 'banana', 'is', 'such'), ('banana', 'is', 'such', 'a'), ('is', 'such', 'a', 'great'), ('such', 'a', 'great', 'fruit')], [('i', 'really', 'love', 'banana'), ('really', 'love', 'banana', 'splits'), ('love', 'banana', 'splits', 'and'), ('banana', 'splits', 'and', 'banana'), ('splits', 'and', 'banana', 'pudding')], [('anna', 'banana', 'had', 'a'), ('banana', 'had', 'a', 'banana'), ('had', 'a', 'banana', 'for'), ('a', 'banana', 'for', 'breakfast')]]
reference_corpus = [[('any', 'fruit', 'is', 'such'), ('fruit', 'is', 'such', 'a'), ('is', 'such', 'a', 'great'), ('such', 'a', 'great', 'breakfast')], [('i', 'really', 'love', 'apple'), ('really', 'love', 'apple', 'splits'), ('love', 'apple', 'splits', 'and'), ('apple', 'splits', 'and', 'bread'), ('splits', 'and', 'bread', 'pudding')], [('anna', 'peterson', 'had', 'a'), ('peterson', 'had', 'a', 'pear'), ('had', 'a', 'pear', 'for'), ('a', 'pear', 'for', 'lunch')]]
from keyness import log_likelihood
output = log_likelihood(corpus, reference_corpus)
To calculate log likelihood based on the number of types (rather than tokens) in a text, use the dist_func keyword argument.
from keyness import log_likelihood
output = log_likelihood(corpus, reference_corpus, dist_func=keyness.type_dist)
By default, frequencies are normalized to the rate per 1,000 words, but this can be changed using the norm_to keyword argument. The example below would normalize the frequencies to a rate of per 100 words.
from keyness import log_likelihood
output = log_likelihood(corpus, reference_corpus, norm_to=100)
By default, output is a list of tuples. Each tuple contains the following data:
- a token or ngram
- its log likelihood value
- its frequency in the corpus
- its rate in the corpus
- its frequency in the reference corpus
- its rate in the reference corpus
Using the default keyword arguments, the output based on frequency distributions from the tokenized texts example above would look like this:
[('banana', 4.452, 3, 150.0, 0, 0.0), ('the', 1.484, 1, 50.0, 0, 0.0), ('bread', 1.293, 0, 0.0, 1, 45.455), ('peterson', 1.293, 0, 0.0, 1, 45.455), ('apple', 1.293, 0, 0.0, 1, 45.455), ('lunch', 1.293, 0, 0.0, 1, 45.455), ('pear', 1.293, 0, 0.0, 1, 45.455), ('any', 1.293, 0, 0.0, 1, 45.455), ('a', 0.009, 2, 100.0, 2, 90.909), ('love', 0.005, 1, 50.0, 1, 45.455), ('such', 0.005, 1, 50.0, 1, 45.455), ('and', 0.005, 1, 50.0, 1, 45.455), ('fruit', 0.005, 1, 50.0, 1, 45.455), ('anna', 0.005, 1, 50.0, 1, 45.455), ('breakfast', 0.005, 1, 50.0, 1, 45.455), ('pudding', 0.005, 1, 50.0, 1, 45.455), ('really', 0.005, 1, 50.0, 1, 45.455), ('i', 0.005, 1, 50.0, 1, 45.455), ('had', 0.005, 1, 50.0, 1, 45.455), ('for', 0.005, 1, 50.0, 1, 45.455), ('is', 0.005, 1, 50.0, 1, 45.455), ('splits', 0.005, 1, 50.0, 1, 45.455), ('great', 0.005, 1, 50.0, 1, 45.455)]
The output can be saved as a .tsv file using the save_as keyword argument. The value of _save_as will be the filepath and name.
log_likelihood(corpus, reference_corpus, save_as="example.tsv")
The delimiter keyword argument determines what items in each row are delimited by. The following will create a .csv file:
log_likelihood(corpus, reference_corpus, save_as="example.csv", delimiter=",")