Backend: TF IDF

The tfidf backend implements a baseline algorithm for automated subject indexing. The idea is to count the frequencies of terms (words) used in documents about each subject, use the TF-IDF algorithm to weight the term frequencies so that rare words are more important than frequently occurring ones, and to create an index for matching term frequencies in new documents to those about specific subjects. The implementation is based on the topic modelling library Gensim.

It is really easy to get started using the TF-IDF backend since it doesn't require any algorithm-specific configuration.

Example configuration

[tfidf-en]
name=TF-IDF English
language=en
backend=tfidf
analyzer=snowball(english)
limit=100
vocab=yso-en

Usage

Load a vocabulary:

annif loadvoc tfidf-en /path/to/Annif-corpora/vocab/yso-en.tsv

Train the model:

annif train tfidf-en /path/to/Annif-corpora/training/yso-finna-en.tsv.gz

Test the model with a single document:

cat document.txt | annif suggest tfidf-en

Evaluate a directory full of files in fulltext document corpus format:

annif eval tfidf-en /path/to/documents/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend: TF IDF

Example configuration

Usage

Clone this wiki locally