Here I collected and implemented most of the known topic diversity measures used for measuring how different topics are. The more diverse the resulting topics are, the higher will be the coverage of the various aspects of the analyzed corpus. It is therefore important to also obtain topics that are different from each other (rather than just considering how much coherent the topics are).
This repository is not maintained anymore. You can find the same measures and even more in this cool repo: OCTIS!
- Proportion of unique words (PUW) [Dieng et al., 2020]
- average Pairwise Jaccard Distance (JD) [Tran et al., 2013]
- Inverted Rank-Biased Overlap (IRBO) [Bianchi et al., 2021a]
- Word Embedding-based Centroid Distance (WE-CD) [Bianchi et al., 2021b]
- Word Embedding-based Pairwise Distance (WE-PD) [Terragni et al., 2021]
- Word Embedding-based Inverted Rank-Biased Overlap (WE-IRBO) [Terragni et al., 2021]
All these metrics are described in [Terragni et al., 2021].
The necessary input for all the metrics is a list of topics, i.e. a list of list of strings. For example:
topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
You can also specify the parameter topk
which represents the number of words considered for each list. Note that topk
must be less or equal than the length of the a topic list.
Here you can find a notebook with some examples: https://github.com/silviatti/topic-model-diversity/blob/master/topic_diversity_experiments.ipynb
topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
proportion_unique_words(topics, topk=3)
Out[1]: 1.0
topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
pairwise_jaccard_diversity(topics, topk=3)
Out[1]: 1.0
This metric requires a word embedding space as input to compute distances (parameter word_embedding_model
). Please, use gensim to load the word embedding space.
import gensim
wv = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')
topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
pairwise_word_embedding_distance(topics, wv, topk=3)
Out[1]: 0.6379696850836505
This metric requires a word embedding space as input to compute distances (parameter word_embedding_model
). Please, use gensim to load the word embedding space.
import gensim
wv = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')
topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
centroid_distance(topics, wv, topk=3)
Out[1]: 0.8380562411966147
Parameter weight
controls how top-weighted the metric is. The smaller the weight
, the more top-weighted the metric is. When weight = 0
, only the top-ranked word is considered.
topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
print("irbo p=0.5:",irbo(topics, weight=0.5, topk=3))
Out[1]: 1.0
This metric requires a word embedding space as input to compute distances (parameter word_embedding_model
). Please, use gensim to load the word embedding space. Parameter weight
controls how top-weighted the metric is. The smaller the weight
, the more top-weighted the metric is. When weight = 0
, only the top-ranked word is considered.
import gensim
wv = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')
topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
word_embedding_irbo(topics,wv, weight=0.9, topk=3)
Out[1]: 0.8225350005800525
For the implementation of inversed ranked-biased overlap, I included the https://github.com/dlukes/rbo package, all the rights reserved to the author of that package.