In this repo, I focused on building end-to-end speech recognition pipeline using Quartznet, wav2vec2.0 and CTC decoder supported by beam search algorithm as well as language model.
Here I used 100h speech public dataset of Vinbigdata , which is a small clean set of VLSP2020 ASR competition. Some infomation of this dataset can be found at data/Data_Workspace.ipynb
. The data format I would use to train and evaluate is just like LJSpeech, so I create data/custom.py
to customize the given dataset.
mkdir data/LJSpeech-1.1
python data/custom.py # create data format for training quartnet & w2v2.0
And below is the folder that I used, note that metadata.csv
has 2 columns, file name
and transcript
:
├───data
│ ├───LJSpeech-1.1
│ │ └───wavs
│ │ └───metadata.csv
│ └───vlsp2020_train_set_02
├───datasets
├───demo
├───models
│ └───quartznet
│ └───base
├───tools
└───utils
You can create your environment and install the requirements file and note that torch should be installed based on your CUDA version. With conda:
cd Vietnamese-Speech-Recognition
conda create -n asr
conda activate asr
conda install --file requirements.txt
Also, you need to install ctcdecode:
git clone --recursive https://github.com/parlance/ctcdecode.git
cd ctcdecode && pip install . && cd ..
For training the quartznet model, you can run:
python3 tools/train.py --config configs/config.yaml
And evaludate quartnet:
python3 tools/evaluate.py --config configs/config.yaml
Or you wanna finetune wav2vec2.0 model from Vietnamese pretrained w2v2.0:
python3 tools/fintune_w2v.py
This time, I provide small code with streamlit for asr demo, you can run:
stream run demo/app.py
I used wandb&tensorboard for logging results and antifacts during training, here are some visualizations after several epochs:
Quartznet | W2v 2.0 |
---|---|
- Mainly based on this implementation
- The paper
- Vietnamese ASR - VietAI
- Lightning-Flash repo
- Tokenizer used from youtokentome
- Language model KenLM