This is the official code for 'Gradient Frequency Attention: Tell Neural Networks where speaker information is.'
Gradient Frequency Attention Mechanism utilizes the weight from trained speaker recognition neural networks to tell a new neural networks where speaker information is. The weight is based on gradient computation, related to Class Activation Mapping (CAM).
By appling the attention mechanism to Convolutional Neural Networks, we prove it to be valid for build a low-cost Speaker Verification (SV) model from a widder or deeper neural network model.
We select VoxCeleb1 and VoxCeleb2 as the datasets. In VoxCeleb1, there are 1,211 speakers with 148,642 utterances for training and 40 speakers with 4,870 utterances for testing. Only the VoxCeleb2 development set is used as a training set. There are 5,994 speakers with 1,092,009 utterances.
All the acoustic features are prepared in kaldi way. And then we generate kaldi-like egs for training. Here we asume that the kaldi has been installed. The data augmentation is the same as the example voxceleb/v2
in kaldi. The clean utterances are augmented by MUSAN and RIR datasets. But we keep all the augmented utterances for sampling.
# Making 161 dimensional spectrograms for dev and test set.
# VoxCeleb1
for name in dev test; do
steps/make_spect.sh --write-utt2num-frames true --spect-config conf/spect_161.conf \
--nj 12 --cmd "$train_cmd" \
data/vox1/klsp/${name} data/vox1/klsp/${name}/log data/vox1/klsp/spect/${name}
utils/fix_data_dir.sh data/vox1/klsp/${name}
done
# VoxCeleb2
for name in dev ; do
steps/make_spect.sh --write-utt2num-frames true --spect-config conf/spect_161.conf \
--nj 12 --cmd "$train_cmd" \
data/vox2/klsp/${name} data/vox2/klsp/${name}/log data/vox2/klsp/spect/${name}
utils/fix_data_dir.sh data/vox2/klsp/${name}
done
# Split part of train set for verification test.
for name in vox1 vox2;do
python datasets/make_trials.py 20000 data/${name}/klsp/dev # generate 2w verification pairs for the dev set
mv data/${name}/klsp/dev/trials trials_2w # rename the trials
python datasets/split_trials_dir.py --data-dir data/${name}/klsp/dev \
--out-dir data/${name}/klsp/dev/trials_dir \
--trials trials_2w
done
# Making egs for 161 dimensional spectrograms
./a_prep_egs.sh
ResCNN is a CNN with Channel Block Attention Blocks for verification as Figure.1. In our testing, this model proved to outperform many SV systems on VoxCeleb1 when we input spectrograms into it. Dropout is applied before the average pooling layer.
TDNN is the neural network meantioned in x-vectors.
ECAPA-TDNN is the neural network meantioned in ECAPA-TDNN.
Additive Angular Margin (Arcsoft) Softmax loss function is adpoted in our experiments, where the margin is 0.2 and scale is 30.
For all the SV systems, embeddings are extracted from the last hidden layer. Cosine similarities are computed for comparsion of pairs of speaker embeddings.
Set the stage variable in the train.sh
script and start training by:
./c_train.sh
By default, during the training stage, the script will valid for each epoch and test once for 4 epoch.
To extract the weights from trained models, the following commands will be needed:
./b_extract.sh
# step1: python gradients/cam_extract.py ------> extract gradients from enough utteranes in training set
# ...
# step2: python gradients/visual_gradient.py ------> compute and save the mean gradients along with the frequency axis
The extracted weights should be stored in .
We extracted Saliency Mappings using InteGrated Gradients (IG) method for input Mel Fbanks. We plot the maximum gradients (5%) for ECAPA-TDNN with different Width (Channels) .
We carried insertion and deletion experiments to compare the performance of Saliency Mapping Methods. We plot the insertion and deletion results as the following figure.
Integrated Gradient (IG) and Expected IG is the most effective method in speaker verification for Saliency Mapping.
Equal Error Rate and Minimum Detection Cost Function are reported here.
clean: the training set is clean Voxceleb1 dev set. aug: the training set is the augmented VoxCeleb1 dev set. mel: the initial weight is linearly distributed and equal in the Mel scale. vox2: the training set is clean VoxCeleb2 dev set.
Model | Data Augment |
Attention Layer |
EER (%) |
Min-DCF | |
---|---|---|---|---|---|
p=0.01 | p=0.001 | ||||
Trainset: VoxCeleb1 dev | |||||
ResCNN-64 | - | - | 3.27 | 0.3078 | 0.4189 |
+ | - | 2.84 | 0.2735 | 0.4051 | |
ResCNN-32 |
- | - | 3.66 | 0.3411 | 0.4408 |
- | mel | 3.43 | 0.3169 | 0.3806 | |
- | clean | 3.27 | 0.3187 | 0.3876 | |
- | aug | 3.27 | 0.3201 | 0.3913 | |
- | vox2 | 3.26 | 0.3032 | 0.4597 | |
ResCNN-16 | - | - | 4.21 | 0.3781 | 0.5214 |
- | mel | 4.23 | 0.3831 | 0.5475 | |
- | clean | 4.11 | 0.3652 | 0.5319 | |
- | aug | 4.27 | 0.3505 | 0.4622 | |
- | vox2 | 4.09 | 0.3792 | 0.4627 | |
TDNN | - | - | 4.77 | 0.4639 | 0.6023 |
TDNN-s | - | - | 4.54 | 0.4565 | 0.6230 |
- | mel | 4.61 | 0.4556 | 0.5933 | |
- | clean | 4.52 | 0.4539 | 0.6195 | |
- | aug | 4.56 | 0.4764 | 0.5762 | |
- | vox2 | 4.42 | 0.4986 | 0.6446 | |
Trainset: VoxCeleb2 dev | |||||
ResCNN-64 | - | - | 1.73 | 0.1562 | 0.2391 |