Implementation of our paper Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments. WASE first explicitly models start/end time of speech (onset/offset cues) in speaker extraction problem.
WASE is adapted based on our previous proposed framework, which includes five modules: voiceprint encoder, onset/offset detector, speech encoder, speech decoder, and speaker extraction module.
In this work, we focus on the onset/offset cues of speech and verify their effectiveness in speaker extraction task. We also combine the onset/offset cues and voiceprint cue. Onset/offset cues model start/end time of speech and voiceprint cue models the voice characteristics. The combination of two perceptual cues bring a significant performance improvement, while extra needed parameters are negligible. Please see the figure below for detailed model structure.
- Python 3.7
- Pytorch 1.0.1
- pysoundfile 0.10.2
- librosa 0.7.2
- Please refer to environment.yml for details.
The training samples are generated by randomly selecting speeches of different speakers from si_tr_s of WSJ0, and mixing them at various signal-to-noise ratios (SNR). The evaluating samples are generated by fixed list ./data/wsj/mix_2_spk_voiceP_tt_WSJ.txt. Please modify the dataset path in ./data/preparedata.py according to your actual path.
data_config['speechWavFolderList'] = ['/home/aa/WSJ/wsj0/si_tr_s/']
data_config['spk_test_voiceP_path'] = './data/wsj/mix_2_spk_voiceP_tt_WSJ.txt'You may need the command below to modify the evaluating data path.
sed -i 's/home\/aa/YOUR PATH/g' data/wsj/mix_2_spk_voiceP_tt_WSJ.txtWe advise you to utilize the pickle file, which could speed up experiments by saving time for frequency resampling. You could modify the pickle path to anywhere you like.
data_config['train_sample_pickle'] = '/data1/haoyunzhe/interspeech/dataset/wsj0_pickle/train_sample.pickle'
data_config['test_sample_pickle'] = '/data1/haoyunzhe/interspeech/dataset/wsj0_pickle/test_sample.pickle'Simply run this command:
python eval.pyThis will load the model onset_offset_voiceprint.pt and verify its performance. It will cost about an hour.
PS. The default setting is using both onset/offset and voiceprint cues. If you want to train a model based on only onset/offset cues or voiceprint cue, please modify the parameters in config.yaml.
ONSET: 1
OFFSET: 1
VOICEPRINT: 1python train.pypython eval.pytensorboard --logdir ./log- Listen to audio samples at ./assets/demo.
- Spectrogram samples (clean/mixture/prediction).
| Methods | #Params | SDRi(dB) |
|---|---|---|
| SBF-MTSAL | 19.3M | 7.30 |
| SBF-MTSAL-Concat | 8.9M | 8.39 |
| SpEx | 10.8M | 14.6 |
| SpEx+ | 13.3M | 17.2 |
| WASE (onset / offset + voiceprint) | 7.5M | 17.05 |
If you want to reproduce the results above, you need to decay learning rate and freeze voiceprint encoder in config.yaml when the model is close to convergence.
FREEZE_VOICEPRINT: 0
learning_rate: 0.001If you find this repo helpful, please consider citing:
@article{hao2020wase,
title={Wase: Learning When to Attend for Speaker Extraction
in Cocktail Party Environments},
author={Hao, Yunzhe and Xu, Jiaming and Zhang, Peng and Xu, Bo}
}
@article{hao2020unified,
title={A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments},
author={Hao, Yunzhe and Xu, Jiaming and Shi, Jing and Zhang, Peng and Qin, Lei and Xu, Bo},
journal={Proc. Interspeech 2020},
pages={1431--1435},
year={2020}
}
For commercial use of this code and models, please contact: Yunzhe Hao(haoyunzhe2017@ia.ac.cn).
This repository contains codes adapted/copied from the followings:
- ./models/tasnet.py from Conv-TasNet (CC BY-NC-SA 3.0);
- ./models/tcn.py from Conv-TasNet (CC BY-NC-SA 3.0).




