This repository is related to the Lab 2 activities of the course ID2223 Scalable Machine Learning and Deep Learning at KTH. The tasks, 1 and 2 are here described.
In this repository you can find:
- Fine Tuning of Whisper small model in Italian Task 1 of the Lab 2
- Fine Tuning of Whisper small model in Italian, divided into two different pipelines Task 2 of the Lab 2
Task 1 consists in the adaptation of the "Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers" blog post by sanchit-gandhi, applying his approach to another language, the authors' mother tongue: italian.
Both projects consists of Jupyter notebooks designed to be run in Google CoLab. The final application is hosted on HuggingFace Spaces.
The project can be divided into three main pipelines:
- Data preparation and Feature engineering: the features are taken from the Common Voice 11.0 dataset, and prepared to be used for training. The dataset was saved on Google Drive, that was used as Feature Store.
- Training and evaluation: the training is carried out on Google Colab using a free account. Checkpoints were used to overcome the limited availability of GPUs on Colab. Finally the model was uploaded to HuggingFace, that was used as model registry.
- Inference program: an application was created and hosted on HuggingFace spaces using Gradio. This application allows to use and see the model's functionalities.
Here's where the data are taken from the Common Voice 11.0 dataset. When getting the train, and test set only 10% of the original dataset is taken, due to to the limited space, time and computing capabilities of our setup.
The next step is to remove the attributes that are not useful for our training, namely accent, age, client_id, down_votes, gender, locale, path, segment and up_votes.
The following step is to down-sample the audio input from 48 kHz to 16 kHz, since 16 kHz is the sampling rate expected from the Whisper model. To do this we use the Whisper Feature Extractor.
After this data preparation, we save the dataset on Google Drive that we use a feature store.
Note: this pipeline can be run on a CPU to save GPU computing resources.
Now that we've prepared our data, we're ready to dive into the training pipeline. The HuggingFace Trainer will do much of the heavy lifting for us. All we have to do is:
- Define a data collator: the data collator takes our pre-processed data and prepares PyTorch tensors ready for the model.
- Evaluation metrics: during evaluation, we want to evaluate the model using the word error rate (WER) metric. We need to define a compute_metrics function that handles this computation.
- Load a pre-trained checkpoint: we need to load a pre-trained checkpoint and configure it correctly for training.
- Define the training configuration: this will be used by the HuggingFace Trainer to define the training schedule. Once we've fine-tuned the model, we will evaluate it on the test data to verify that we have correctly trained it to transcribe speech in Italian.
For the training we have set a series of hyper-parameters. This is the summary:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 8
- training_steps: 4000
- gradient_accumulation_steps: 2
- save_steps: 100
- eval_steps: 100
The save_steps were size in such a way, to allow Google Colab to run for a while, but avoid the situation where the execution time available for the virtual machine would run out, without having saved anything. The gradient_accumulation_steps were increased to 2 to fit the dimension of the GPU (NVIDIA T4 Tensor Core with 15360 MiB of VRAM). Other values were left as in the original blog post.
At the end of training the model is pushed to HuggingFace, that acts as a model registry.
Note: The notebook was run 15 times, with approximately 40 min for each 100 steps of training for a total of 26.5h of training. Keep in mind that Google Colab was available to us for no more than 4 h a day, so around 7 days were necessary for training alone.
The Gradio application consists of two interfaces:
- Audio or Microphone: receives as input a recording or an audio file, and outputs a written transcription
- YouTube: receives as input a YouTube link and outputs a written transcription.
Additionally to the transcription, the application was connected to text2tags model that given a text provides tags for the text. This is also outputted additionally after the transcription.
The model achieves the following results:
- Loss: 0.4549
- Wer: 200.40
The value of the Wer is quite low. This could be caused mainly due to the limited size of the dataset we got, that is only 10% of the original dataset. See Improving model performance section for the full discussion.
The intermediate values are the following:
Run Number | Step | Training Loss | Validation Loss | Wer |
---|---|---|---|---|
1 | 100 | 1.2396 | 1.2330 | 176.40 |
2 | 200 | 0.7389 | 0.8331 | 80.49 |
2 | 300 | 0.2951 | 0.4261 | 70.20 |
2 | 400 | 0.2703 | 0.4051 | 101.60 |
3 | 500 | 0.2491 | 0.3923 | 112.20 |
3 | 600 | 0.1700 | 0.3860 | 107.10 |
3 | 700 | 0.1603 | 0.3836 | 90.36 |
4 | 800 | 0.1607 | 0.3786 | 135.00 |
4 | 900 | 0.1540 | 0.3783 | 99.05 |
4 | 1000 | 0.1562 | 0.3667 | 98.32 |
4 | 1100 | 0.0723 | 0.3757 | 158.90 |
5 | 1200 | 0.0769 | 0.3789 | 215.20 |
5 | 1300 | 0.0814 | 0.3779 | 170.50 |
5 | 1400 | 0.0786 | 0.3770 | 140.60 |
5 | 1500 | 0.0673 | 0.3777 | 137.10 |
6 | 1600 | 0.0339 | 0.3892 | 166.50 |
7 | 1700 | 0.0324 | 0.3963 | 170.90 |
7 | 1800 | 0.0348 | 0.4004 | 163.40 |
8 | 1900 | 0.0345 | 0.4016 | 158.60 |
8 | 2000 | 0.0346 | 0.4020 | 176.10 |
8 | 2100 | 0.0317 | 0.4001 | 134.70 |
9 | 2200 | 0.0173 | 0.4141 | 189.30 |
9 | 2300 | 0.0174 | 0.4106 | 175.00 |
9 | 2400 | 0.0165 | 0.4204 | 179.60 |
10 | 2500 | 0.0172 | 0.4185 | 186.10 |
10 | 2600 | 0.0142 | 0.4175 | 181.10 |
11 | 2700 | 0.0090 | 0.4325 | 161.70 |
11 | 2800 | 0.0069 | 0.4362 | 161.20 |
11 | 2900 | 0.0093 | 0.4342 | 157.50 |
12 | 3000 | 0.0076 | 0.4352 | 154.50 |
12 | 3100 | 0.0089 | 0.4394 | 184.30 |
13 | 3200 | 0.0063 | 0.4454 | 166.00 |
13 | 3300 | 0.0059 | 0.4476 | 179.20 |
13 | 3400 | 0.0058 | 0.4490 | 189.60 |
14 | 3500 | 0.0051 | 0.4502 | 194.20 |
14 | 3600 | 0.0064 | 0.4512 | 187.40 |
14 | 3700 | 0.0053 | 0.4520 | 190.20 |
14 | 3800 | 0.0049 | 0.4545 | 194.90 |
15 | 3900 | 0.0052 | 0.4546 | 199.60 |
15 | 4000 | 0.0054 | 0.4549 | 200.40 |
A series of approaches is possible, let's go over some of these options:
- Regularization: implementing regularization techniques such as early stopping can prevent over-fitting.
- Hyper-parameter tuning: tuning hyper-parameters such as learning rate and batch size can improve the performance of the model.
- Optimization Algorithms: adding an optimizer, such as SGD can improve the overall performance.
The most simple approach to improve the model's performance, would be to fine-tune the Whisper model on the whole common_voice_11_0 dataset instead of the 10% of it. This would be possible only if more computing resources (or time) and storage space are provided. For example we could make use of Colab Pro for having access to more powerful GPUs and for more time. This would allow the training algorithm to run smoothly even on more samples, instead of having to break the computation in a high number of different runs. Keep in mind that the 10% of the dataset occupied 16 GB of Storage in Google Drive, so going for the full dataset, might also require additional Google Drive storage.
Another approach to improve the model's performance, would be to train the model on a larger and more recent dataset. An example is the more recent version of the common_voice dataset by Mozilla Foundation, i.e. common_voice_15.0. Note that even on more recent data, if the size of the dataset remains small, the results won't improve by much.
Other than improving the overall performance of the model, other techniques allow the computation to use less time or resources such as computing power or storage space. Here we can note that the requirement for this project, i.e. checkpointing, increases training time of about 20%. We also used Gradient accumulation that allowed us to take advantage of the full GPU, without going out of bound. These are the techniques that were taken in to consideration in this project. More techniques are available here
Visual Studio Code - main IDE
GitKraken - git versioning
Google Colab - running environment
HuggingFace - dataset, model registry, GUI
Gradio - GUI