This repository contains experiments comparing the accuracy of open source Finnish part-of-speech taggers and lemmatization algorihtms.
- Experimental Finnish model for spaCy
- FinnPos
- Simplemma
- Stanza
- Trankit
- Turku neural parser pipeline
- UDPipe (through spacy-udpipe)
- UralicNLP
- Voikko
- FinnTreeBank 1: randomly sampled subset of about 1000 sentences
- FinnTreeBank 2: news, Sofie and Wikipedia subsets
- Turku Dependency Treebank: the testset
Install dependencies:
- Python 3.9
- libvoikko with Finnish morphology data files
- clang (or other C++ compiler)
- Dependencies needed to compile FinnPos and cg3
Setup git submodules, create a Python virtual environment and download test data and models by running the following commands:
git submodule init
git submodule update
python3.9 -m venv venv
source venv/bin/activate
pip install wheel
pip install -r requirements.txt
./download_data.sh
./download_models.sh
python preprocess_data.py
export PATH=$(pwd)/models/cg3/src:$PATH
# Predict lemmas and POS tags using all models.
# Writes results under results/predictions/*/
python predict.py
# Evaluate by comparing the predictions with the gold standard data.
# Writes results to results/evaluation.csv
python evaluate.py
# Plot the evaluations.
# Saves the plots under results/images/
python plot_results.py
The numerical results will be saved in results/evaluation.csv, POS and lemma errors made by each model will be saved in results/errorcases, and plots will be saved in results/images.
Lemmatization error rates (proportion of tokens where the predicted lemma differs from the ground truth lemma) for the tested algorithms on the test datasets.
Execution duration as a function of the average (over datasets) error rate. Lower values are better on both axes. Notice that the Y-axis is on log scale.
The execution duration is measured as a batched evaluation (a batch contains all sentences from one dataset) on a 4 core CPU. Turku neural parser and StanfordNLP can be run on a GPU which most likely improves their performance, but I haven't tested that.
Part-of-speech error rates for the tested algorithms.
Note that FinnPos and Voikko do not make a distinction between auxiliary and main verbs and therefore their performance suffers by 4-5% in this evaluation as they mispredict all AUX tags as VERBs.
Execution duration as a function of the average error rate.
Comparing spacy-fi and StanfordNLP results, it seems that increasing the computational effort about 100-fold seems to improve the accuracy only by a small amount.