Train your own Speaker Embedding here or use my pre-trained model with LstmDV
The pre-trained model is trained by 913 speaker with 53 utterances , Download the dataset from openSLR train-clean-360.tar.gz and ignore the speaker wich utterances number is lower than 50, the model performance is test with 40 speaker from VCTK dataset.
Model | LstmDV | MetaDV |
---|---|---|
EER | 3% | 2% |
-
Put your Speaker Embedding model in ./model/static/model.pt
-
Run make_spec.ipynb and make_metadata.ipynb with the data as following format.
- model - static - model.pt - make_data - factory - wavs - 225 (include many audio data) - 226 - ... - ... - make_metadata.ipynb - make_spec.ipynb
-
After that you will get a ./spmel (default name) folder and a train.pkl, copy ./spmel to root dir.
- AutoVC (Original Implement)
- MetaPool
- MetaConv
python train.py --model_name=AutoVC --data_dir=spmel --save_model_name=model_name
python train_with_discriminator.py --model_name=AutoVC --data_dir=spmel --save_model_name=model_name
## Still Working