In our paper,
we proposed a lightweight alternative for speech synthesis to perform real-time speech anonymization.
We provide our implementation and pretrained models in this repository.
Abstract : Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that re-synthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms, and a lite version (0.1x in size) that further reduces latency to 66ms while maintaining state-of-the-art performance in naturalness, intelligibility, and privacy preservation.
Visit our demo website for audio samples.
- Python >= 3.10
- Clone this repository.
- Install python requirements. Please refer requirements.txt
- Download and extract the LibriTTS dataset.
And move all wav files to
data
folder - Convert all files to waveform and rearrange the data folder to look like
data
├── metadata
└── speakers # Folder containing all speakers
├── spkr1
├── spkr2
│ └── wav # Folder containing all wav files
│ ├── file1.wav
│ ├── file2.wav
│ └── file3.wav
├── spkr3
└── spkr4
- Build similar structure separately for validation or test data.
- Download the
Hubert Base
checkpoint from here and place it underpretrained_models/hubert
folder.
To preprocess data, run
./data_preprocess.sh
Edit data_preprocess.sh
file accordingly to preprocess data at a different path, specifically change data
argument to point to the path of your data folder. Need to run separately for train and validation folders.
After running the data preprocessing, your data
directory should look like
data
├── metadata
└── speakers
├── spkr1
├── spkr2
│ ├── code
│ │ ├── file1.km
│ │ └── file2.km
│ ├── energy
│ │ ├── file1.eng.npy
│ │ └── file2.eng.npy
│ ├── pitch
│ │ ├── file1.pit.npy
│ │ └── file2.pit.npy
│ ├── spkr
│ │ ├── file1.spk.npy
│ │ └── file2.spk.npy
│ └── wav
│ ├── file1.wav
│ └── file2.wav
├── spkr3
└── spkr4
- Pretrain only the
Encoder
python train_encoder.py -p experiments/base/encoder -c experiments/base/config.json
- Train
Encoder
andDecoder
together
python train.py -p experiments/base -c experiments/base/config.json
You can also use pretrained models we provide.
Download pretrained models
Place the pretrained base
and lite
checkpoints under experiments/base
and experiments/lite
respectively.
See inference.ipynb
.
We referred to HiFiGAN, Fairseq and Speechbrain to implement this.