Lexupdater is a tool to extend and update the NST pronunciation lexicon with new words and dialect variation in the pronunciation transcriptions.
The dialectal variation is updated through string transformation rules (search-and-replace with rege patterns) developed by trained linguists in the Language Bank at the National Library of Norway.
Since NST was first published before 2000, new words occurring after 2000 have been added from the corpora Norwegian Newspaper Corpus Bokmål and Målfrid 2021 – Freely Available Documents from Norwegian State Institutions.
Enure you have python
version 3.8
or higher.
Create a virtual environment and activate it, e.g.
python -m venv .venv
source .venv/bin/activate
Install lexupdater:
pip install git+https://github.com/Sprakbanken/lexupdater.git@v0.7.6
The NST pronunciation lexion is availalbe in an SQLite database nst_lexicon_bm.db
. It has a table with words
, and with pronunciations (base
).
Lexupdater uses external python files with dicts of regex patterns to update the database, and a csv-file to add new words. These files are available from the
nb_uttale
-repo.
./fetch_data.sh
- Download the pronunciation database by clicking this link: https://www.nb.no/sbfil/uttaleleksikon/nst_lexicon_bm.db
- Use git commands to fetch the rules and newwords from
nb_uttale
:
git remote add nb_uttale git@github.com:Sprakbanken/nb_uttale.git
git fetch nb_uttale
git show nb_uttale/main:data/input/rules_v1.py > rules.py
git show nb_uttale/main:data/input/exemptions_v1.py > exemptions.py
git show nb_uttale/main:data/input/newwords_2022.csv > newwords.csv
git remote remove nb_uttale
Run lexupdater newwords
from your command line.
Run lexupdater update
from the command line.
The update
command and the default settings correspond to the following:
lexupdater -v \
--database "nst_lexicon_bm.db" \
--newwords-path "newwords.csv" \
--dialects e_spoken \
-d e_written \
-d sw_spoken \
-d sw_written \
-d w_spoken \
-d w_written \
-d t_spoken \
-d t_written \
-d n_spoken \
-d n_written \
update \
--rules-file "rules.py" \
--exemptions-file "exemptions.py" \
--output-dir "data/output"
The parameters database
, output_dir
, newwords_path
, dialects
and the update
-parameters rules_file
and exemptions_file
can be changed in your local config.py
.
You can also set the parameters directly from the command line. See the help
flag for more info:
lexupdater -h
We use pyproject.toml
to configure the package.
python -m build .
The python distribution wheel is located in the dist
-folder.
It can be intsalled with pip
:
pip install dist/lexupdater-*.whl # OS-independent