Work repo for Stortinget Speech Corpus

Short introduction

This repo contains the code used for creating the Stortinget Speech Corpus, a large speech corpus with speech from Stortinget and transcriptions extracted from the proceedings at Stortinget. See our paper for more information about this dataset.

Content

data/ contains ASR transcriptions and proceedings
matching/ contains a modified version of the matching code from CLARINSI.

Matching

The ASR transcriptions need to be inverse-normalized. Clone and install the normalization code:

git clone https://github.com/Sprakbanken/sprakbanken_normalizer.git

python -m pip install .

Example code

from sprakbanken_normalizer.inverse_text_normalizer import inv_normalize

print(inv_normalize("dette tallet er tre hundre tusen fire hundre og tjueto"))

Saving results to an SQLite database

An SQL database needs to be built first to save the references to the proceedings and the transcriptions.

python3 run.py results.db

Make a csv file with extracted transcriptions

python3 make_match_csv.py /path/to/outfile.csv

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
matching		matching
.gitignore		.gitignore
LESMEG.md		LESMEG.md
LICENSE		LICENSE
README.md		README.md
annotate_data.ipynb		annotate_data.ipynb
make_annotation_data.py		make_annotation_data.py
make_match_csv.py		make_match_csv.py
prepare_db.py		prepare_db.py
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Work repo for Stortinget Speech Corpus

Short introduction

Content

Matching

Saving results to an SQLite database

Make a csv file with extracted transcriptions

About

Releases

Packages

Contributors 2

Languages

License

Sprakbanken/transcription_matching

Folders and files

Latest commit

History

Repository files navigation

Work repo for Stortinget Speech Corpus

Short introduction

Content

Matching

Saving results to an SQLite database

Make a csv file with extracted transcriptions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages