E2E method for conversion spoken numbers to text numbers
P.S.
This project is a test task. The goal of the project is to show how the problem can be approached. Quality and results have a lower priority
The method receives an Russian-speech WAV audio (mono, 16000KHz) file and uses it as an input to the model, based on a QuartzNET - deep convolutional neural network.
The implementation of the model is taken from NeMo framework
The method uses an ASR-trained model with a specific vocabulary that covers all possible numerical verbal transcriptions up to 1,000,000
It also uses an open source library text2num and num2words
The training dataset had file paths and numbers that are spoken in the audio file. The numbers were translated into text using an open source library num2words. Next, the ASR model was trained. Model recognizes spoken numbers. Transcription of the audio file is converted back to a number using library text2num
The user should run the script and provide the path to the csv file, as well as the path to the output file. Output file will contain two columns: path and number. Input file should contain one column : path
Example:
sh inference.sh examples/example_input.csv results.csv
Examples of input and output files can be found in the folder examples/
Below are the validation graphs when training the model
The model was also tested on test data
- WER: 0.1313
- CER: 0.0532
However, when the model was tested on "in the wild data", the results were poor. This can be solved by increasing the unique speakers and increasing the amount of training data.
- Python 3.8 or above
- Pytorch 1.10.0 or above
- NVIDIA GPU for fast inference
Сreating an environment
conda create --name russian_numbers python==3.8
conda install pytorch torchaudio cudatoolkit=11.3 -c pytorch
apt-get update && apt-get install -y libsndfile1 ffmpeg
# or pip3
pip install Cython
pip install nemo_toolkit['asr']
pip install text_unidecode
- Create end-to-end inference without using bash console
- Сreate training guide
- Improve data labeling (in this case, automatic labeling was used)
- More training data (in this case, 5500 samples were used)
- Increase model (up to 2 mb, in this case - 500Kb or ~94k parameters)