lexical-stress-detection is a deep learning model that identifies if a vowel phoneme in an isolated word is stressed or un-stressed.
The image below summarizes the project
To retrain the model follow the steps below:
The first step of feature extraction is forced phoneme alignment of audio files. Refer to the alignment readme.
Phoneme alignment needs the files in .wav format. If you've .flacc files, use this script to convert them to wav files.
This process extracts spectral and non spectral features from the of each phoneme, stores them as numpy arrays and writes to disk as .npy files.
Since stress on a particular vowel phoneme is related to other vowel phonemes within that word, features of each phoneme is sandwiched between the features of preceding and succeeding phoneme.
For each phoneme two files are generated:
*_mfcc.npy
: Spectral features - 13 MFCCs for 10 frames, their derivatives and double derivatives. This is represented as a matrix of shape13 x 30
. Preceding an succeeding phoneme features are added as channels, thus the final shape of the matrix is3 x 13 x 30
. Refer tomfcc_extraction.py
*_other.npy
: Non Spectral Features - 6 non spectral features for the phoneme represented as a vector of length 6 which becomes a vector of length 18 after including preceding an succeeding phoneme features. Refer tonon_mfcc_extraction.py
Sample generation is be done by running the sample_generation.py
script which takes three inputs as command line arguments:
- Root path of the folder with wav files split into phonemes. This is the output of phoneme alignment.
- Tab separated file with phoneme info. This is the csv generated by phoneme alignment.
- Output path where npy files will be generated. Output will be split into three folders 0, 1 and 2 each with unstressed, primary stress and secondary stress phoneme features.
Sample generation script is parallelized, CPU with 16 or more cores is recommended for running it.
After sample generation, primary stress phonemes were twice as much than unstressed. Secondary stress were a very small percentage of the total. For this project we completely ignored secondary stress and randomly sampled primary stress features approximately equal to unstressed.
We removed features of 80 stop words. Since the npy file names have the word in them, a sample script can be written for this action.
Use train_test_split.py
to split data into train and val sets.
The script needs three input parameters as command line arguments.
- Root path of the folder where npy sample files are stored.
- Train path
- Test path
- Test percentage - a floating point number int the range (0,1). We used 0.15.
The model
is a combination of CNN and DNN. Spectral features are fed into the CNN and the
non spectral into DNN. The output form these networks are concatenated and fed into another DNN and finally, the
softmax loss layer is used.
The training.py
script takes five command line arguments:
- Root path of train data
- Root path of val data
- Path where saved models are saved. If existing model checkpoint is present in this folder, training will resume from that checkpoint. 4.Learning rate 5.Number of epochs
Hyper parameters like batch size can be changed in this script.
It also generates a file data_check_test.csv
which has some info about predictions on the val set. This is useful
for debugging which samples are incorrectly classified. The five columns in the file are:
- path: name with full path of the npy files
- label: true label of the sample
- pred: prediction by the model
- prob_0: probability of predicting unstressed
- prob_1: probability of predicting primary stress
Sample csv file:
path,label,pred,prob_0,prob_1
test/0/libri_5808-54425-0000_is_ih0_mfcc.npy,0,0,0.9996665716171265,0.00033342366805300117
test/1/libri_5808-54425-0000_years_ih1_mfcc.npy,1,1,6.26739677045407e-07,0.9999994039535522
test/1/libri_5808-54425-0000_five_ay1_mfcc.npy,1,1,2.3276076888123498e-07,0.9999997615814209
test/1/libri_5808-54425-0000_but_ah1_mfcc.npy,1,1,4.2122044874304265e-07,0.9999995231628418