Skip to content
Juho Inkinen edited this page Sep 23, 2022 · 12 revisions

The pav backend implements a trainable dynamic ensemble that intelligently combines results from multiple projects. Subject suggestion requests to the ensemble backend will be re-routed to the source projects. The results from the source projects will be re-weighted using isotonic regression, which attempts to convert raw scores to probabilities. The regression is implemented using the PAV algorithm available in the scikit-learn library. The regression is performed separately for each concept and the results are combined by calculating the mean of regressed scores (i.e. estimated probabilities) for each concept.

Note: See nn_ensemble for an alternative dynamic ensemble backend that can also be further trained during use, unlike PAV.

Example configuration

[pav-en]
name=PAV ensemble English
language=en
backend=pav
sources=tfidf-en,mllm-en
min-docs=3
limit=100
vocab=yso

The sources setting is a comma-separated list of projects whose results will be combined. Optional weights may be given like this:

sources=tfidf-en:1,mllm-en:2

This setting would give twice as much weight on results from mllm-en compared to results from tfidf-en.

The min-docs setting specifies how many positive examples of a concept are required in the training data in order to create a regression model for that concept. Recommended values are between 3 and 10. When not enough positive examples are available, raw scores are used instead, similar to the basic ensemble backend.

Usage

Load a vocabulary:

annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl

Train the ensemble:

annif train pav-en /path/to/Annif-corpora/training/yso-finna-en.tsv.gz

Test the model with a single document:

cat document.txt | annif suggest pav-en

Evaluate a directory full of files in fulltext document corpus format:

annif eval pav-en /path/to/documents/
Clone this wiki locally