This repo contains the code used for creating the Stortinget Speech Corpus, a large speech corpus with speech from Stortinget and transcriptions extracted from the proceedings at Stortinget. See our paper for more information about this dataset.
data/
contains ASR transcriptions and proceedingsmatching/
contains a modified version of the matching code from CLARINSI.
The ASR transcriptions need to be inverse-normalized. Clone and install the normalization code:
git clone https://github.com/Sprakbanken/sprakbanken_normalizer.git
python -m pip install .
Example code
from sprakbanken_normalizer.inverse_text_normalizer import inv_normalize
print(inv_normalize("dette tallet er tre hundre tusen fire hundre og tjueto"))
An SQL database needs to be built first to save the references to the proceedings and the transcriptions.
python3 run.py results.db
python3 make_match_csv.py /path/to/outfile.csv