Skip to content

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Speech Editing and Synthesis

License

Notifications You must be signed in to change notification settings

WangHelin1997/SSR-Speech

Repository files navigation

SSR-Speech

Paper Mandarin Models English Models Demo page

Official Pytorch implementation of the paper: SSR-Speech: Towards Stable, Safe and Robust Zero-shot Speech Editing and Synthesis.

⭐ Work done during an internship at Tencent AI Lab

TODO

  • Release English model weights
  • Release Mandarin model weights
  • HuggingFace Spaces demo
  • Fix gradio app
  • arxiv paper
  • WhisperX forced alignment
  • ASR for automatically transcipt the prompt for TTS
  • Simplify the inference stage

Environment setup

conda create -n ssr python=3.9.16
conda activate ssr

pip install git+https://github.com/WangHelin1997/SSR-Speech.git#subdirectory=audiocraft
pip install xformers==0.0.22
pip install torchaudio torch
apt-get install ffmpeg
apt-get install espeak-ng
pip install tensorboard==2.16.2
pip install phonemizer==3.2.1
pip install datasets==2.16.0
pip install torchmetrics==0.11.1
pip install huggingface_hub==0.22.2

# only use for inference
pip install gradio==3.50.2
pip install nltk>=3.8.1
pip install openai-whisper>=20231117
pip install whisperx==3.1.5
pip install faster-whisper==1.0.1
pip install num2words==0.5.13
pip install opencc-python-reimplemented

Pretrained Models

Download our pretrained English models from huggingface. We provide an Watemark Encodec model, a pretrained English model on GigaSpeech XL set, and a pretrained Mandarin model on internal data (25,000 hours).

After downloading the files, put them under this repo, like:

SSR-Speech/
    -data/
    -demo/
    -pretrained_models/
    ....

Inference examples

For English speech editing test, please run:

python inference_v2.py  \
    --seed 2024 \
    --sub_amount 0.12 \
    --aug_text \
    --use_watermark \
    --language 'en' \
    --model_path "./pretrained_models/English.pth" \
    --codec_path "./pretrained_models/wmencodec.th" \
    --orig_audio "./demo/84_121550_000074_000000.wav" \
    --target_transcript "But when I saw the mirage of the lake in the distance, which the sense deceives, Lost not by distance any marks," \
    --temp_folder "./demo/temp" \
    --output_dir "./demo/generated_se" \
    --savename "84_121550_000074_00000" \
    --whisper_model_name "base.en"

For English zero-shot TTS test, please run:

python inference_v2.py  \
    --seed 2024 \
    --tts \
    --aug_text \
    --use_watermark \
    --language 'en' \
    --model_path "./pretrained_models/English.pth" \
    --codec_path "./pretrained_models/wmencodec.th" \
    --orig_audio "./demo/5895_34622_000026_000002.wav" \
    --prompt_length 3 \
    --target_transcript "I cannot believe that the same model can also do text to speech synthesis too!" \
    --temp_folder "./demo/temp" \
    --output_dir "./demo/generated_tts" \
    --savename "5895_34622_000026_000002" \
    --whisper_model_name "base.en"

For Mandarin speech editing test, please run:

python inference_v2.py  \
    --seed 2024 \
    --sub_amount 0.12 \
    --aug_text \
    --use_watermark \
    --language 'zh' \
    --model_path "./pretrained_models/Mandarin.pth" \
    --codec_path "./pretrained_models/wmencodec.th" \
    --orig_audio "./demo/aishell3_test.wav" \
    --target_transcript "食品价格以基本都在一万到两万之间" \
    --temp_folder "./demo/temp" \
    --output_dir "./demo/generated_se" \
    --savename "aishell3_test" \
    --whisper_model_name "base"

For Mandarin zero-shot TTS test, please run:

python inference_v2.py  \
    --seed 2024 \
    --tts \
    --aug_text \
    --use_watermark \
    --language 'zh' \
    --model_path "./pretrained_models/Mandarin.pth" \
    --codec_path "./pretrained_models/wmencodec.th" \
    --orig_audio "./demo/aishell3_test.wav" \
    --prompt_length 3 \
    --target_transcript "我简直不敢相信同一个模型也可以进行文本到语音的生成" \
    --temp_folder "./demo/temp" \
    --output_dir "./demo/generated_tts" \
    --savename "aishell3_test" \
    --whisper_model_name "base"

Training

To train an SSR-Speech model, you need to prepare the following parts:

  1. Prepare a json file saving data in the following format (including utterances and their transcripts):
{
"segment_id": "YOU1000000012_S0000106",
"wav": "/data/gigaspeech/wavs/xl/YOU1000000012/YOU1000000012_S0000106.wav",
"trans": "then you can look at o b s or wirecast as a professional solution then. if you're on a mac and you're looking for a really cheap and easy way to create a professional live stream.",
"duration": 9.446044921875
}
  1. Encode the utterances into codes using e.g. Encodec. Run:
export CUDA_VISIBLE_DEVICES=0
cd ./data
AUDIO_PATH=''
SAVE_DIR=''
ENCODEC_PATH=''
DATA_NAME=''
python encode.py \
--dataset_name ${DATA_NAME} \
--audiopath ${AUDIO_PATH} \
--save_dir ${SAVE_DIR} \
--encodec_model_path ${ENCODEC_PATH} \
--batch_size 32 \
--start 0 \
--end 10000000

Here, AUDIO_PATH is the path where the json file was saved, SAVE_DIR is the path where the processed data will be saved, ENCODEC_PATH is the path of a pretrained encodec model and DATA_NAME is the saved name of the dataset. Here the start and end indexes are used for multi-gpu processing.

  1. Convert transcripts into phoneme sequence. Run:
AUDIO_PATH=''
SAVE_DIR=''
DATA_NAME=''
python phonemize.py \
--dataset_name ${DATA_NAME} \
--dataset_dir ${AUDIO_PATH} \
--save_dir ${SAVE_DIR}

Add language='cmn' in Line 47 (phonemize.py) when you process Mandarin.

  1. Prepare manifest (i.e. metadata). Run:
AUDIO_PATH=''
SAVE_DIR=''
DATA_NAME=''
python filemaker.py \
--dataset_name ${DATA_NAME} \
--dataset_dir ${AUDIO_PATH} \
--save_dir ${SAVE_DIR}
  1. Prepare a phoneme set (we named it vocab.txt)
SAVE_DIR=''
DATA_NAME=''
python vocab.py \
--dataset_name ${DATA_NAME} \
--save_dir ${SAVE_DIR}

Now, you are good to start training!

cd ./z_scripts
bash e830M.sh

If your dataset introduce new phonemes (which is very likely) that doesn't exist in the giga checkpoint, make sure you combine the original phonemes with the phoneme from your data when construction vocab. And you need to adjust --text_vocab_size and --text_pad_token so that the former is bigger than or equal to you vocab size, and the latter has the same value as --text_vocab_size (i.e. --text_pad_token is always the last token). From our experience, you can set --text_vocab_size to 100 for an English model and 200 for a Mandarin model.

Training WaterMarking Encodec

To train the Watermarking Encodec, you need to:

  1. install our audiocraft package,
cd ./audiocraft
pip install -e .
  1. prepare data (for training, validataion and test), e.g.
python makefile.py
  1. change the settings in ./audiocraft/config/ to your own and start training,
dora run -d solver='compression/encodec_audiogen_16khz' dset='internal/sounds_16khz'

License

The codebase is under MIT LICENSE. Note that we use some of the code from other repository that are under different licenses: ./models/modules, ./steps/optim.py, data/tokenizer.py are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License.

Acknowledgement

We thank Puyuan for his VoiceCraft.

Citation

@article{wang2024ssrspeech,
  author    = {Wang, Helin and Yu, Meng and Hai, Jiarui and Chen, Chen and Hu, Yuchen and Chen, Rilin and Dehak, Najim and Yu, Dong},
  title     = {SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis},
  journal   = {arXiv},
  year      = {2024},
}

About

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Speech Editing and Synthesis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages