Skip to content

Commit

Permalink
Pull request johan BUT sre (#326)
Browse files Browse the repository at this point in the history
* first commit

* Initial CTS recipe

* v3 recipe which is based on CTS superset. More generic embedding processing before backend. PLDA multisession scoring.

* Minor fixes of sre/v3 recipe

* Some path corrections.

* Minor corrections of sre/v3/recipe

* Minor corrections in sre/v3 recipe.

* Adding some missing scripts.

* Added some flexibility for VAD usage in data preparation.

* Bugfixes

* minor fixes in sre datapreparation

* Minor bugfix

* Minor bugfix

* Adding cosine scoring with LDA etc. preprocessing.

* Updated README

* Updated README.

* Updated README.

* Updated README.

* Updated README.

* Updated README.

* Updated README.

* Updated README.

* Updated README.

* Updated README.

* Updated README.

* Fixed flake errors

* Changed tabs to spaces.

* Fix trailing spaces.

* Remove dependence on sph2pipe including adding in modifed scripts from Kaldi SRE16 recipe.

* Fix spaces.

* Updated README

* Some yapf fixes

* Updating README. Mostly a dummy commit to do pre commit testing after merging in master branch.

---------

Co-authored-by: Rohdin Johan A. <rohdin@pczmolikova.fit.vutbr.cz>
  • Loading branch information
gulamungon and Rohdin Johan A. authored Aug 29, 2024
1 parent 4be9d57 commit 03ceb00
Show file tree
Hide file tree
Showing 40 changed files with 3,501 additions and 15 deletions.
29 changes: 29 additions & 0 deletions examples/sre/v3/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
Changed a little in make_system_sad.py to make split a large data set in parts
when extracting VAD. It took ages to start otherwise and this will also be
helpful in case there is a crash since output is saved after each part instead
of after the whole set.

# We use some scripts from Kaldi (combine_data.sh and fix_data_dir.sh)

# This should not be needed anymore.
# ln -s $KALDI_ROOT/egs/wsj/s5/utils
# export PATH=$PATH:$(pwd)/utils/ # This is necessary since some Kaldi scripts assume other Kaldi scripts exists in the path.
#export PATH=$PATH:$KALDI_ROOT/


CTS
spk / utt
Org. data 6867 / 605760
After VAD 6867 / 605704
After removing T < 5s 6867 / 604774
After removing utt/spk < 3 6867 / 604774

VOX
spk / utt
Org. data 7245 / 1245525
After VAD 7245 / 1245469
After removing T < 5s 7245 / 816385
After removing utt/spk < 3 7245 / 816385

Total
After removing utt/spk < 3 14112 / 1421159
99 changes: 99 additions & 0 deletions examples/sre/v3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
### Main differences from ../v2
* The training data is the CTS superset plus VoxCeleb with GSM codec
* The test data is SRE16, SRE18, and SRE21
* Preprocessing of embeddings before backend/scoring is supported

### Important
Similarly to ../v2, this recipe uses silero vad https://github.com/snakers4/silero-vad
downloaded from here https://github.com/snakers4/silero-vad/archive/refs/tags/v4.0.zip
If you intended to use this recipe for an evaluation/competition, make sure to check that
it is allowed to use the data that has been used to train Silero.

### Instructions
* Set the paths in stage 1. The variable ```sre_data_dir``` is assumed to be prepared by
Kaldi (https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16/v2).
Only the eval and unlabeled (major) data of sre16 is taken from there.
```voxceleb_dir``` is the path to voxceleb prepared by wespeaker (```../../voxceleb/v2```).
If you set it to "" (empty string), the preparation will be run here. For the other datasets,
the path to the folder provided by LDA should be provided. The relevant LDC numbers and
file names of the data can be seen in the script. If you don't have
one or two of the "eval/dev" sets of "sre16", "sre18" or "sre21" and not specify it, you may
have to comment it from some more places in order to avoided crashes. (Eventually
the script will hopefully be made more robust to this.)
If you don't have the CTS superset data, you can skip stage 5 in ```local/prepare_data.sh```
and instead replace the CTS data it with some other data, e.g., the training data prepared in ```../v2```
If so, it is probably the easiest to name this data "CTS" since this name is assumed later
in the recipe.
* Select which torchrun command to use in stage 3. The first line
(currently commented) is for "single-node, multi-worker" (one
pytorch job per machine). The second line is for "Stacked
single-node multi-worker" (more than one pytorch job may be
submitted to the same node in your cluster.) See
https://pytorch.org/docs/stable/elastic/run.html for explanations.
* Stage 3 (training) and stage 4 (embedding extraction) need GPU. You may have
to arrange how to run these parts based on your environment.


### Explanation of embedding processing

The code supports flexible combinations of embedding processing steps, such as length-norm and LDA.
A processing chain is specified e.g., as follows
```
mean-subtract --scp $mean1_scp | length-norm | lda --scp $lda_scp --utt2spk $utt2spk --dim $lda_dim | length-norm"
```
The script ```wespeaker/bin/prep_embd_proc.py``` takes such a processing chain as input, loops through the processing steps (separated by ```|```), calculates
the necessary processing parameters (means, lda transforms etc.) and stores the whole processing chain with parameters in
pickle format. The parameters for each step will be calculated sequentially and the data specified for the parameter estimation of a step will
be processed by the earlier steps. Therefore the data for the different steps can be different. For example when estimating LDA in the above chain, the data given by ```$lda_scp``` will first be processed by ```mean-subtract``` whose parameters were estimated by ```$mean1_scp``` which could be a different dataset.
In scenarios where unlabeled domain adaptation data is available, we want to use this data for the first mean subtraction while still using the out domain data for LDA estimation. This CANNOT be achieved by specifying the processing chain
```
mean-subtract --scp $indomain_scp | length-norm | lda --scp $lda_scp --utt2spk $utt2spk --dim $lda_dim | length-norm
```
since this would have the consequence that in LDA estimation, the data (```$lda_scp```) would be subjected to mean subtraction
using the mean of the indomain data (```$indomain_scp```). To solve this, we have an additional script ```wespeaker/bin/update_embd_proc.py``` used as follows
```
new_link="mean-subtract --scp $indomain_scp"
python wespeaker/bin/update_embd_proc.py --in_path $preprocessing_path_cts_aug --out_path $preprocessing_path_sre18_unlab --link_no_to_remove 0 --new_link "$new_link"
```
where ```$preprocessing_path_cts_aug``` is the path to the pickled original processing chain and ```$preprocessing_path_sre18_unlab``` is the path to the new pickled processing chain.
The script will remove link 0, e.g. ```mean-subtract --scp $mean1_scp``` and replace it with ```mean-subtract --scp $indomain_scp```.


### Regarding extractor training data pruning

Similarly to ```../v2``` and Kaldi's sre16 recipe, we discard some of the training utterances based on duration as well as training speakers based on their number of utterances.
This is controlled in stage 9 of ```local/prepare_data.sh```. It is quite flexible but currently a bit messy and some consequences of the settings are not obvious. Therefore some explanation is provided here.
There are three "blocks" in stage 9:
* The first block discards all utterances shorter or equal to some specified duration (currently set to 5s) according to VOICED DURATION.
* The second block discards all utterances shorter or equal to some specified duration (currently set to 5s) according to TOTAL DURATION, i.e., ignoring VAD info.
* The third block discards all speakers that has less than or equal to a specified number of utterances. (Currently set to 2, i.e. speaker with 3 or more utterances are kept.)
It is possible to set the thresholds differently for the different sets. IMPORTANT: The pruning in block 1 is based on ```data/data_set_name/utt2voice_dur``` which is calculated
from the VAD info, so if a recording does not have any speech, it will not be present in utt2voice_dur and therefore discarded in this block even if the duration threshold is
set to e.g. -1. If we want such utterances to be kept for one set we should not run this block for the set (as currently is the case for voxceleb). The current setup is as follows:
1. Apply block one to CTS but not Voxceleb
2. Apply block two to Voxceleb but not CTS. (Applying this stage to CTS would not have an effect if the thresholds are the same since the total duration is always larger or equal to the voiced duration.)
3. Apply stage three to both CTS and VoxCeleb.

This means Voxceleb recordings are kept even if they have no speech accordng to VAD. The later shard creation stage applies VAD if available, otherwise keeps the file as it is. So Voxceleb recording with no speech according to VAD will NOT be discarded (but there are only around 70 of them which is unlikely to have any effect on the trained system.). Also, there is a risk that pruning according to total duration while applying VAD in shard creation could result in recordings shorter than "num_frms". These will be zero padded at training time so there will be no crash but this is probably also suboptimal.
These is setting are arguably somewhat weird. Applying block one also to voxceleb (and not using block two at all) would be more reasonable but it seems to degrade the performance due to discarding too many files. A better solution than the current would be to try with smaller thresholds than 5s but we have had not had time to explore this yet. Also, it would be reasonable to discard recordings with no speech according to VAD in the shard creation stage. However, when no VAD is available for a file, the shard creation code does not know whether this is because no speech was detected for this file according to VAD, or because VAD was not ran for this file. Since we want to have the possibility to keep recordings for which the latter is the case, we have it this way (it could for example be considered not to use VAD for voxceleb at all, in which case we need to avoid discarding these files at the shard creation stage). A more flexible and clear solution is needed and we will work on this for future updates.


### Some data statistics
| | CTS #utt | CTS #spk | CTS #utt | CTS #spk | comment|
| --- | --- | --- | --- | --- | --- |
|Original data | 605760 | 6867 | 1245525 | 7245 | |
|exclud recording with nospeech acording to VAD| 605704 | 6867 | 1245455 | 7245 | VAD is a bit random so these numbers could vary slightly, especially for voxceleb. |
|After filtering according voiced duration | 604774 | 6867 | 816411 | 7245 | Accordingly, here too. We don't use this for voxceleb in the current settings. |
|After filtering according total duration | - | - | 868326 | 7245 | Haven't checked this for CTS.

No speaker are discarded in block three with the current setting.


### Things to explore
Very few things have been tuned. For example the following could be low-hanging fruits:
* The above mentioned pruning rules
* Utterance durations of the training segments.
* Shall voxceleb be included? Is applying the GSM codec a good idea? (Note that GSM codec is applied in the data preparation stage while augmentation is applied at training time, i.e, GSM codec comes before augmentations. This is not so realistic, since in reality noise and reverberation comes before the data is recorded and encoded. However, it is consistent with CTS where we also apply augmentations at the already encoded audio since it was encoded at recording time.)
* The other architectures.

We will tune this futher in the future. We are also happy to hear about any such results obtained by others.
81 changes: 81 additions & 0 deletions examples/sre/v3/conf/resnet.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
### train configuration

exp_dir: exp/ResNet34-TSTP-emb256-fbank40-num_frms200-aug0.6-spFalse-saFalse-Softmax-SGD-epoch150
gpus: "[0,1]"
num_avg: 10
enable_amp: False # whether enable automatic mixed precision training

seed: 42
num_epochs: 150
save_epoch_interval: 5 # save model every 5 epochs
log_batch_interval: 100 # log every 100 batchs

dataloader_args:
batch_size: 256
num_workers: 7 # Total number of cores will be (this +1)*num_gpus
pin_memory: False
prefetch_factor: 8
drop_last: True

dataset_args:
# the sample number which will be traversed within one epoch, if the value equals to 0,
# the utterance number in the dataset will be used as the sample_num_per_epoch.
sample_num_per_epoch: 780000
shuffle: True
shuffle_args:
shuffle_size: 1500
filter: True
filter_args:
min_num_frames: 100
max_num_frames: 300
resample_rate: 8000
speed_perturb: False
num_frms: 200
aug_prob: 0.6 # prob to add reverb & noise aug per sample
fbank_args:
num_mel_bins: 64
frame_shift: 10
frame_length: 25
dither: 1.0
spec_aug: False
spec_aug_args:
num_t_mask: 1
num_f_mask: 1
max_t: 10
max_f: 8
prob: 0.6

model: ResNet34 # ResNet18, ResNet34, ResNet50, ResNet101, ResNet152
model_init: null
model_args:
feat_dim: 64
embed_dim: 256
pooling_func: "TSTP" # TSTP, ASTP, MQMHASTP
two_emb_layer: False
projection_args:
project_type: "softmax" # add_margin, arc_margin, sphere, softmax, arc_margin_intertopk_subcenter

margin_scheduler: MarginScheduler
margin_update:
initial_margin: 0.0
final_margin: 0.2
increase_start_epoch: 20
fix_start_epoch: 40
update_margin: True
increase_type: "exp" # exp, linear

loss: CrossEntropyLoss
loss_args: {}

optimizer: SGD
optimizer_args:
momentum: 0.9
nesterov: True
weight_decay: 0.0001

scheduler: ExponentialDecrease
scheduler_args:
initial_lr: 0.1
final_lr: 0.00005
warm_up_epoch: 6
warm_from_zero: True
119 changes: 119 additions & 0 deletions examples/sre/v3/local/create_preproc_embd_lists.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
#!/bin/bash

# Copyright (c) 2024 Johan Rohdin (rohdin@fit.vutbr.cz)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The preprocessed embeddings are already stored but we need to create the lists
# as score.sh wants them.

exp_dir=$1
data=data

# We have three different preprocessors for which we need to prepare the lists
# embd_proc_cts_aug.pkl # LDA and cts_aug mean subtraction
# embd_proc_sre16_major.pkl # LDA and sre16_major mean subtracion (Only used for SRE16)
# embd_proc_sre18_dev_unlabeled.pkl # LDA and sre18_dev_unlabeled mean subtracion (Only used for SRE18)


### !!!
# Note that xvector2 is only a hack for BUT

##################################################################
# CTS AUG for all sets
echo "mean vector of enroll"
python tools/vector_mean.py \
--spk2utt ${data}/sre16/eval/enrollment/spk2utt \
--xvector_scp $exp_dir/embeddings/sre16/eval/enrollment/xvector_proc_embd_proc_cts_aug.scp \
--spk_xvector_ark $exp_dir/embeddings/sre16/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.ark

python tools/vector_mean.py \
--spk2utt ${data}/sre18/dev/enrollment/mdl_id2utt \
--xvector_scp $exp_dir/embeddings/sre18/dev/enrollment/xvector_proc_embd_proc_cts_aug.scp \
--spk_xvector_ark $exp_dir/embeddings/sre18/dev/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.ark

python tools/vector_mean.py \
--spk2utt ${data}/sre18/eval/enrollment/mdl_id2utt \
--xvector_scp $exp_dir/embeddings/sre18/eval/enrollment/xvector_proc_embd_proc_cts_aug.scp \
--spk_xvector_ark $exp_dir/embeddings/sre18/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.ark

python tools/vector_mean.py \
--spk2utt ${data}/sre21/dev/enrollment/mdl_id2utt \
--xvector_scp $exp_dir/embeddings/sre21/dev/enrollment/xvector_proc_embd_proc_cts_aug.scp \
--spk_xvector_ark $exp_dir/embeddings/sre21/dev/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.ark

python tools/vector_mean.py \
--spk2utt ${data}/sre21/eval/enrollment/mdl_id2utt \
--xvector_scp $exp_dir/embeddings/sre21/eval/enrollment/xvector_proc_embd_proc_cts_aug.scp \
--spk_xvector_ark $exp_dir/embeddings/sre21/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.ark


# Create one scp with both enroll and test since this is expected by some scripts
cat ${exp_dir}/embeddings/sre16/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.scp \
${exp_dir}/embeddings/sre16/eval/test/xvector_proc_embd_proc_cts_aug.scp \
> ${exp_dir}/embeddings/sre16/eval/xvector_proc_embd_proc_cts_aug.scp

cat ${exp_dir}/embeddings/sre18/dev/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.scp \
${exp_dir}/embeddings/sre18/dev/test/xvector_proc_embd_proc_cts_aug.scp \
> ${exp_dir}/embeddings/sre18/dev/xvector_proc_embd_proc_cts_aug.scp

cat ${exp_dir}/embeddings/sre18/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.scp \
${exp_dir}/embeddings/sre18/eval/test/xvector_proc_embd_proc_cts_aug.scp \
> ${exp_dir}/embeddings/sre18/eval/xvector_proc_embd_proc_cts_aug.scp

cat ${exp_dir}/embeddings/sre21/dev/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.scp \
${exp_dir}/embeddings/sre21/dev/test/xvector_proc_embd_proc_cts_aug.scp \
> ${exp_dir}/embeddings/sre21/dev/xvector_proc_embd_proc_cts_aug.scp

cat ${exp_dir}/embeddings/sre21/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.scp \
${exp_dir}/embeddings/sre21/eval/test/xvector_proc_embd_proc_cts_aug.scp \
> ${exp_dir}/embeddings/sre21/eval/xvector_proc_embd_proc_cts_aug.scp


##################################################################
# sre16_major for sre16 eval
echo "mean vector of enroll"
python tools/vector_mean.py \
--spk2utt ${data}/sre16/eval/enrollment/spk2utt \
--xvector_scp $exp_dir/embeddings/sre16/eval/enrollment/xvector_proc_embd_proc_sre16_major.scp \
--spk_xvector_ark $exp_dir/embeddings/sre16/eval/enrollment/enroll_spk_xvector_proc_embd_proc_sre16_major.ark

# Create one scp with both enroll and test since this is expected by some scripts
cat ${exp_dir}/embeddings/sre16/eval/enrollment/enroll_spk_xvector_proc_embd_proc_sre16_major.scp \
${exp_dir}/embeddings/sre16/eval/test/xvector_proc_embd_proc_sre16_major.scp \
> ${exp_dir}/embeddings/sre16/eval/xvector_proc_embd_proc_sre16_major.scp


##################################################################
# sre18_dev_unlabeled for sre18 dev/eval
echo "mean vector of enroll"
python tools/vector_mean.py \
--spk2utt ${data}/sre18/dev/enrollment/mdl_id2utt \
--xvector_scp $exp_dir/embeddings/sre18/dev/enrollment/xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
--spk_xvector_ark $exp_dir/embeddings/sre18/dev/enrollment/enroll_spk_xvector_proc_embd_proc_sre18_dev_unlabeled.ark

python tools/vector_mean.py \
--spk2utt ${data}/sre18/eval/enrollment/mdl_id2utt \
--xvector_scp $exp_dir/embeddings/sre18/eval/enrollment/xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
--spk_xvector_ark $exp_dir/embeddings/sre18/eval/enrollment/enroll_spk_xvector_proc_embd_proc_sre18_dev_unlabeled.ark

# Create one scp with both enroll and test since this is expected by some scripts
cat ${exp_dir}/embeddings/sre18/dev/enrollment/enroll_spk_xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
${exp_dir}/embeddings/sre18/dev/test/xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
> ${exp_dir}/embeddings/sre18/dev/xvector_proc_embd_proc_sre18_dev_unlabeled.scp

cat ${exp_dir}/embeddings/sre18/eval/enrollment/enroll_spk_xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
${exp_dir}/embeddings/sre18/eval/test/xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
> ${exp_dir}/embeddings/sre18/eval/xvector_proc_embd_proc_sre18_dev_unlabeled.scp

1 change: 1 addition & 0 deletions examples/sre/v3/local/download_data.sh
Loading

0 comments on commit 03ceb00

Please sign in to comment.