Here are the steps for implementing an automatic speech recognition system for air traffic control communication dataset in the internship at the Asr Gooyesh Company
- the first phase; searching for relevant articles and presenting them in PowerPoint format
- second phase: fine-tuning the wav2vec2-large-xlsr-53 model on the ShEMO (Persian Speech Emotion Detection) database
- third phase: fine-tuning the wav2vec2-base model on the English Timit dataset
- fourth phase: fine-tuning the wav2vec2-large-robust model on air traffic dataset
After searching for various journals on this subject with the help of my group mate, I came to this google sheet and then myself in this google sheet lead to select the reference article titled How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications. Moreover, I presented a summary of what I received in this PowerPoint.
Second phase: Fine-tuning the wav2vec2-large-xlsr-53 model on the ShEMO (Persian Speech Emotion Detection) database
Run the code available on HuggingFace, to understand what fine-tuning the model is and how it should be done in my mother tongue to make the the preprocessing steps and results more comprehensible.
Running the code available on HuggingFace, to get closer to the main project, which was in English..
The dataset that was used was ATCOSIM. It consists of ten hours of speech data recorded during ATC real-time simulations, automatically segmented, and orthographically transcribed. The utterances are in English language and pronounced by ten non-native operational controllers.
The most important stages: * Prepare Data, Tokenizer, Feature Extractor: * Generate a new CSV file so that it has a column of audio file path
- Load Train and Test dataset:
- Separating the dataset to train and test sets Train set would look like this:
-
Create Wav2Vec2 Feature Extractor:
- Downsample the data because the ATCOSIM dataset sampled with 32kHz but our fine-tuning dataset sampled with 16kHz
-
Preprocess Data
-
Add the "speech" column to the dataset to read the audio files.
-
Training and Evaluation
-
Preparing arguments for our pre-trained model
-
After training, we reach WER around 0.3, which is reasonable:
- In the final step, we evaluate the model. Here are ten random examples of our results with 35% WER: