To be able to train the speech-recognition engine DeepSpeech, audio-files should not be longer than 10s. Therefore, this repo offers the possibility to easily split audio files based on the subtitle-info in srt-files and prepare corresponding transcript files.
Table of Contents
This section will explain what the modules and script do in order to provide a deeper understanding of the individual steps and facilitate modification
FYI: the script "convert_srt_to_csv.py" is meant to be used on srt files with encoding "cp1252" (a.k.a Windows 1252). The reason for this is that in order to keep characters such as "ä", the files have to be encoded to "utf8". If your files are already in "utf8", then deactive the module "change_encoding".
First: Create a folder called "srt_files" where you store your srt_files and a folder "audio" where you store your audio-files (wmv or mp4).
Check the folder Example Files to see how the information is extracted from an srt-file to a csv-file.
- change_encoding: The encoding of srt-files is changed from cp1252 to utf-8.
- convert_srt_to_csv: Start time, end-time and subtitle are extracted from the srt-files and stored in a csv. In preparation for audio-splitting, a column id is generated from the filename with the addition of a unique number.
- wmv_to_wav & mp4_to_wav: Extraction of audio from wmv or mp4 files.
- pre_process_audio: Audio is processed to meet DeepSpeechs requirements of sample-rate 16kHz and bit-rate 16.
- split_files: The audio files are splitted according to the start- and end-times in the csv files. The splitted audio-files are named after the id given in the transcripts.
- create_DS_csv: This module creates a csv with filepaths and filesizes of all processed audio files.
- merge_csv: Merge_csv joins all seperate csv-files.
- merge_transcripts_and_wav_files: This module matches the transcripts to the available audio files.
- clean_unwanted_characters: Unwanted characters are removed. After cleaning the transcripts, the text is extracted and saved in a txt file which can be used for training the language model.
- split_dataset: The final transcripts are splitted into train, test, and dev files and stored in "./final_csv". (train: 75%, test: 15%, dev: 10%)
- audio_metrics: After the converter has successfully run, audio metrics and metrics on the subsets of the dataset are provided
Estimation on execution time: full processing with all above modules took 1h 11m for an audio-dataset of 12GB.
In order to compile the necessary language models required by DeepSpeech, the alphabet.txt has to be configured to the generated dataset.
The script check_caracters.py (provided by DeepSpeech) generates a list of characters that appear in the csv-files. It can be instantiated like this: "$ python3 ./Check_Characters/check_characters.py -csv './final_csv/dev.csv' -alpha"
Use this script to check that the text has been cleaned of unwanted characters. If unwanted characters appear, add them to the module "clean_unwanted_characters".
This importer was created as part of the Master Thesis "Automatic Speech Recognition for Swiss German using Deep Neural Networks" for the degree Master of Business Innovation at the University of St. Gallen by Tobias Rordorf. In case of questions please feel free to contact me through LinkedIn.