WASE

Overview

Implementation of our paper Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments. WASE first explicitly models start/end time of speech (onset/offset cues) in speaker extraction problem.

Model

WASE is adapted based on our previous proposed framework, which includes five modules: voiceprint encoder, onset/offset detector, speech encoder, speech decoder, and speaker extraction module.

In this work, we focus on the onset/offset cues of speech and verify their effectiveness in speaker extraction task. We also combine the onset/offset cues and voiceprint cue. Onset/offset cues model start/end time of speech and voiceprint cue models the voice characteristics. The combination of two perceptual cues bring a significant performance improvement, while extra needed parameters are negligible. Please see the figure below for detailed model structure.

Datasets

WSJ0

Dependencies

Python 3.7
Pytorch 1.0.1
pysoundfile 0.10.2
librosa 0.7.2
Please refer to environment.yml for details.

Getting Started

Modify dataset path

The training samples are generated by randomly selecting speeches of different speakers from si_tr_s of WSJ0, and mixing them at various signal-to-noise ratios (SNR). The evaluating samples are generated by fixed list ./data/wsj/mix_2_spk_voiceP_tt_WSJ.txt. Please modify the dataset path in ./data/preparedata.py according to your actual path.

data_config['speechWavFolderList'] = ['/home/aa/WSJ/wsj0/si_tr_s/']
data_config['spk_test_voiceP_path'] = './data/wsj/mix_2_spk_voiceP_tt_WSJ.txt'

You may need the command below to modify the evaluating data path.

sed -i 's/home\/aa/YOUR PATH/g' data/wsj/mix_2_spk_voiceP_tt_WSJ.txt

We advise you to utilize the pickle file, which could speed up experiments by saving time for frequency resampling. You could modify the pickle path to anywhere you like.

data_config['train_sample_pickle'] = '/data1/haoyunzhe/interspeech/dataset/wsj0_pickle/train_sample.pickle'
data_config['test_sample_pickle'] = '/data1/haoyunzhe/interspeech/dataset/wsj0_pickle/test_sample.pickle'

Run the demo

Simply run this command:

python eval.py

This will load the model onset_offset_voiceprint.pt and verify its performance. It will cost about an hour.

PS. The default setting is using both onset/offset and voiceprint cues. If you want to train a model based on only onset/offset cues or voiceprint cue, please modify the parameters in config.yaml.

ONSET: 1
OFFSET: 1
VOICEPRINT: 1

Training

python train.py

Evaluation

python eval.py

View tensorboardX

tensorboard --logdir ./log

Result

Audio Sample

Listen to audio samples at ./assets/demo.
Spectrogram samples (clean/mixture/prediction).

Metric

Methods	#Params	SDRi(dB)
SBF-MTSAL	19.3M	7.30
SBF-MTSAL-Concat	8.9M	8.39
SpEx	10.8M	14.6
SpEx+	13.3M	17.2
WASE (onset / offset + voiceprint)	7.5M	17.05

If you want to reproduce the results above, you need to decay learning rate and freeze voiceprint encoder in config.yaml when the model is close to convergence.

FREEZE_VOICEPRINT: 0
learning_rate:  0.001

Citations

If you find this repo helpful, please consider citing:

@article{hao2020wase,
  title={Wase: Learning When to Attend for Speaker Extraction
in Cocktail Party Environments},
  author={Hao, Yunzhe and Xu, Jiaming and Zhang, Peng and Xu, Bo}
}

@article{hao2020unified,
  title={A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments},
  author={Hao, Yunzhe and Xu, Jiaming and Shi, Jing and Zhang, Peng and Qin, Lei and Xu, Bo},
  journal={Proc. Interspeech 2020},
  pages={1431--1435},
  year={2020}
}

License

For commercial use of this code and models, please contact: Yunzhe Hao(haoyunzhe2017@ia.ac.cn).

This repository contains codes adapted/copied from the followings:

./models/tasnet.py from Conv-TasNet (CC BY-NC-SA 3.0);
./models/tcn.py from Conv-TasNet (CC BY-NC-SA 3.0).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WASE

Overview

Model

Datasets

Dependencies

Getting Started

Modify dataset path

Run the demo

Training

Evaluation

View tensorboardX

Result

Audio Sample

Metric

If you want to reproduce the results above, you need to decay learning rate and freeze voiceprint encoder in config.yaml when the model is close to convergence.

Citations

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
data		data
models		models
utils		utils
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
environment.yml		environment.yml
eval.py		eval.py
onset_offset_voiceprint.pt		onset_offset_voiceprint.pt
train.py		train.py

haoyz/WASE

Folders and files

Latest commit

History

Repository files navigation

WASE

Overview

Model

Datasets

Dependencies

Getting Started

Modify dataset path

Run the demo

Training

Evaluation

View tensorboardX

Result

Audio Sample

Metric

If you want to reproduce the results above, you need to decay learning rate and freeze voiceprint encoder in config.yaml when the model is close to convergence.

Citations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages