This repository contains experiments comparing the accuracy of open source Finnish part-of-speech taggers and lemmatization algorihtms.
- spaCy 3.3.0
- Experimental Finnish model for spaCy 0.10.0b1
- FinnPos git commit 81c1f735 (Oct 2019)
- Simplemma 0.6.0
- Stanza 1.4.0
- Trankit 1.1.1
- Turku neural parser pipeline git commit 8c9425dd (Jan 2022)
- UDPipe (through spacy-udpipe 1.0.0)
- UralicNLP 1.3.0
- libvoikko 4.3.1 and Python voikko module 0.5
- FinnTreeBank 1 v1: randomly sampled subset of about 1000 sentences
- FinnTreeBank 2: news, Sofie and Wikipedia subsets
- UD_Finnish-TDT r2.9: the testset
Install dependencies:
- Python 3.9
- libvoikko with Finnish morphology data files
- clang (or other C++ compiler)
- Dependencies needed to compile FinnPos and cg3
Setup git submodules, create a Python 3.9 (must be 3.9 because the Turku parser is incompatible with more recent Python versions) virtual environment and download test data and models by running the following commands:
git submodule init
git submodule update
python3.9 -m venv venv
source venv/bin/activate
pip install wheel
pip install -r requirements.txt
./download_data.sh
./download_models.sh
./run.sh
The numerical results will be saved in results/evaluation.csv, POS and lemma errors made by each model will be saved in results/errorcases, and plots will be saved in results/images.
Lemmatization error rates (proportion of tokens where the predicted lemma differs from the ground truth lemma) for the tested algorithms on the test datasets.
Execution duration as a function of the average (over datasets) error rate. Lower values are better on both axes. Notice that the Y-axis is on log scale.
The execution duration is measured as a batched evaluation (a batch contains all sentences from one dataset) on a 4 core CPU. Turku neural parser and StanfordNLP can be run on a GPU which most likely improves their performance, but I haven't tested that.
Part-of-speech error rates for the tested algorithms.
Note that FinnPos and Voikko do not make a distinction between auxiliary and main verbs and therefore their performance suffers by 4-5% in this evaluation as they mispredict all AUX tags as VERBs.
Execution duration as a function of the average error rate.
Comparing spacy-fi and StanfordNLP results, it seems that increasing the computational effort about 100-fold seems to improve the accuracy only by a small amount.