Reproducible code for DLfM 2018 paper: An extended jingju solo singing voice dataset and its application on automatic assessment of singing pronunciation and overall quality at phoneme-level
Download the datasets, nacta, nacta_2017 and primary school. You should at least download the wav and the annotation in textgrid format.
Change the root paths in src.filePath.py
to the your local paths of these 3 datasets.
Change data_path_phone_embedding_model
in src.filePath.py
to where you want to store the extracted features.
Do
python ./dataCollection/trainingSampleCollectionPhoneEmbedding.py
to extract features for ANOVA feature analysis.
You can also skip this step by directly downloading the extracted log-mel features log-mel-scaler-keys-label-encoder.zip
from zenodo page, then unzip it your local data_path_phone_embedding_model
.
When the train data is ready, we can run the below script to train pronunciation or overall quality embedding models
- train pronunciation embedding classification model
python ./training_scripts/embedding_model_train_pronunciation.py -d <string: train_data_path> -o <string: model_output_path> -e <string: experiment>
- train overall quality embedding classification model
python ./training_scripts/embedding_model_train_overall_quality.py -d <string: train_data_path> -o <string: model_output_path> -e <string: experiment>
- -d <string: train_data_path>: feature, scaler, feature dictionary keys, train validation split file
- -o <string: model_output_path>: where to save the output model
- -e <string: experiment>: 'baseline', 'attention', 'dense', 'cnn', '32_embedding', 'dropout', 'best_combination'
When the classification embedding model has been trained, we can run the below scripts to get the validation or test sets
evaluation results. Or you can download the pretrained embedding models pretrained_embedding_models.zip
from zenodo page, then unzip them into your embedding model path.
- evaluate pronunciation embeddings
python ./evaluation/eval_embedding_pronunciation.py -d <string: val_test_data_path> -v <string: val_or_test> -e <string: experiment> -o <string: result_output_path> -m <string: model_path>
- evaluate overall quality embeddings
python ./evaluation/eval_embedding_overall_quality.py -d <string: val_test_data_path> -v <string: val_or_test> -e <string: experiment> -o <string: result_output_path> -m <string: model_path>
- -d <string: val_test_data_path>: feature, scaler, feature dictionary keys
- -v <string: val_or_test>: "val" validaton or "test" test
- -e <string: experiment>: 'baseline', 'attention', 'dense', 'cnn', '32_embedding', 'dropout', 'best_combination'
- -o <string: result_output_path>: where to save the result csv
- -m <string: model_path>: embedding model path
The goal of doing ANVOA feature analysis is to find the most discriminant individual feature to separate Professional, amateur train and validation, and amateur test phoneme samples.
Download the datasets, nacta, nacta_2017 and primary school. You should at least download the wav and the annotation in textgrid format.
Change the root paths in src.filePath.py
to the your local paths of these 3 datasets.
Change phn_wav_path
in src.filePath.py
to where you want to store the phoneme-level wav locally, which will be
used for the ANOVA feature analysis.
Do
python ./dataCollection/phoneme_wav_sample_collection.py
to extract phoneme-level wav for ANOVA feature analysis.
Then do
python ./ANOVA_exp/freesound_feature_extraction.py
to extract acoustic features by using freesound extractor. This step requires Essentia installed. Please check this link for the Essentia installation detail.
You can also skip the previous steps by directly downloading
the acoustic features anova_analysis_essentia_feature.zip
from zenodo page, then
unzip it into your local phn_wav_path
.
Finally, do
python ./ANOVA_exp/anova_calculation.py
to calculate the ANOVA F-value, sort the feature according to its F-value and plot the feature distributions.
Do the step Phoneme embedding training feature extraction to extract log-mel features for professional and amateur recordings.
Do
python ./tsne_embedding_extractor.py -d <string: train_data_path> -m <string: embedding_model_path> -o <string: embedding_output_path> --dense <bool>
to calculate the classification model embeddings for overall quality aspect.
- -d <string: train_data_path>: feature, scaler, feature dictionary keys
- -m <string: embedding_model_path>: embedding model path
- -o <string: embedding_output_path>: output calculated embeddings path
- --dense : using 32 embedding dimension or not
We provide the calculated embeddings in ./eval/phone_embedding_classifier
path, so you can skip the previous step.
Do
python ./tsne_plot.py -e <string: input_embedding_path> --dense <bool>
to plot the t-SNE visualization for each phoneme.
- -e <string: input_embedding_path>: embedding path
- --dense : using 32 embeddings dimension or not