The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers.
This corpus was generated by using a new corpus data ingestion and processing system called Pansori. Please refer to this code repository and the following paper for further information on the Pansori ASR corpus generation system:
- Yoona Choi and Bowon Lee, "Pansori: ASR Corpus Generation from Open Online Video Contents," Proceedings of IEEE Seoul Section Student Paper Contest 2018, Hongik University, pp. 117--121, Nov. 2018.
Extra care was taken to maintain the quality of the generated corpus:
- Only TEDx talks hand transcribed by community translators were included.
- Corpus fragments were segmented at subtitle boundaries.
- Fine tuning segmentation by manual (tool-assisted) speech-text alignment.
- Final validation by state-of-the-art speech recognizer (Google Cloud Speech-To-Text).
The speech audio included in the corpus are 16 bit FLAC files with sampling rate of 16 KHz. Further information on the included speech contents is summarized in the following table:
The corpus can be downloaded either individually or as a whole from the GitHub repository. Alternatively, they are also available for download in one single archive file in the following link: https://storage.googleapis.com/pansori/corpus/pansori-tedxkr-corpus-1.0.tar.gz [170MB].
We are currently preparing a large-sized Korean language ASR corpus by further automating the data processing pipeline used to generate this TEDxKR corpus. The new Korean ASR corpus will also be released under a permissive license once we confirm the types of license with the license holder.