Skip to content

warnikchow/kosp2e

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kosp2e

Korean Speech to English Translation Corpus

Dataset

Freely available

  • Speech files
  • Train/Dev/Test filenames' list their English translation

Provided under request (in this link)

  • Korean scripts
  • Other metadata (for StyleKQC and Covid-ED)

Howto

git clone https://github.com/warnikchow/kosp2e
cd kosp2e
cd data
wget https://www.dropbox.com/s/y74ew1c1evdoxs1/data.zip
unzip data.zip

Then you get the folder with speech files (data and subfolders) and split files' list (split and .xlsx files).

Specification

Dataset License Domain Characteristics Volume
(Train / Dev / Test)
Tokens
(ko / en)
Speakers
(Total)
Zeroth CC-BY 4.0 News / newspaper DB originally for
speech recognition
22,263 utterances
(3,004 unique scripts)
(21,589 / 197 / 461)
72K / 120K 115
KSS CC-BY-NC-SA 4.0 Textbook
(colloquial
descriptions)
Originally recorded
by a single speaker
(multi-speaker
recording augmented)
25,708 utterances
= 12,854 * 2
(recording augmented)
(24,940 / 256 / 512)
64K / 95K 17
StyleKQC CC-BY-SA 4.0 AI agent
(commands)
Speech act (4)
and topic (6)
labels are included
30,000 utterances
(28,800 / 400 / 800)
237K / 391K 60
Covid-ED CC-BY-NC-SA 4.0 Diary
(monologue)
Sentences are in
document level;
emotion tags included
32,284 utterances
(31,324 / 333 / 627)
358K / 571K 71
  • The total number of .wav files in Zeroth dataset does not match with the total number of translation pairs that are provided, since some of the examples were excluded in the corpus construction to guarantee the data quality. However, to maintain files of the original Zeroth dataset, we did not delete them from the .wav files folder. The preprocessing and data loading is not affected by the difference of file list.

Baseline

Model BLEU WER
(ASR)
BLEU
(MT/ST)
ASR-MT (Pororo) 16.6 34.0 18.5 (MT)
ASR-MT (PAPAGO) 21.3 34.0 25.0 (MT)
Transformer (Vanilla) 2.6 - -
ASR pretraining 5.9 24.0* -
Transformer + Warm-up 8.7 - 35.7 (ST)*
+ Fine-tuning 18.3 - -
  • Some of the numerics differ from the paper (after fixing some errors), but may not influence the results much.

Recipe

wget https://github.com/pytorch/fairseq/archive/148327d8c1e3a5f9d17a11bbb1973a7cf3f955d3.zip
unzip 148327d8c1e3a5f9d17a11bbb1973a7cf3f955d3.zip
pip install -e ./fairseq-148327d8c1e3a5f9d17a11bbb1973a7cf3f955d3/

pip install -r requirements.txt
  • First, you preprocess the data, and then prepare them in a format that fit with transformer. Other part follows fairseq S2T translation recipe with MuST-C.
  • This recipe leads you to the Vanilla model (the most basic end-to-end version). For the advanced training, refer to the paper below.
python preprocessing.py

python prep_data.py --data-root dataset/ --task st --vocab-type unigram --vocab-size 8000

fairseq-train dataset/kr-en  --config-yaml config_st.yaml \
--train-subset train_st --valid-subset dev_st --save-dir result --num-workers 4 \
--max-tokens 40000 --max-update 50000 --task speech_to_text \
--criterion label_smoothed_cross_entropy --report-accuracy \
--arch s2t_transformer_s --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 --fp16 

Acknowledgement

This work was supported by PAPAGO, NAVER Corp. The authors appreciate Hyoung-Gyu Lee, ‪Eunjeong Lucy Park, Jihyung Moon, and Doosun Yoo for discussions and support.‬ Also, the authors thank Taeyoung Jo, Kyubyong Park, and Yoon Kyung Lee for sharing the resources.

Copyright

Copyright 2021-present NAVER Corp.

License

License of each subcorpus (including metadata and Korean script) follows the original license of the raw corpus. For KSS and Covid-ED, only academic usage is permitted.

Citation

@inproceedings{cho21b_interspeech,
  author={Won Ik Cho and Seok Min Kim and Hyunchang Cho and Nam Soo Kim},
  title={{kosp2e: Korean Speech to English Translation Corpus}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3705--3709},
  doi={10.21437/Interspeech.2021-1040}
}

arXiv version is here.

Contact

Contact Won Ik Cho tsatsuki@snu.ac.kr for further question.

About

Korean Speech to English Translation Corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages