Skip to content

TCPGen in Conformer RNN-T #2890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 120 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
c686251
first commit BrianSun
Aug 6, 2022
8a6ae5c
second commit
Aug 6, 2022
b52cda0
add specific paths to cudnn
Aug 30, 2022
657d8c6
constructing tree done
Sep 7, 2022
4084d5f
Add biasing LIBRISPEECH data processor
Sep 7, 2022
96eeb6f
implemented tree search and DBRNNT training procedure
Sep 15, 2022
1a00344
add fused log smax option
Oct 26, 2022
b8621e8
fixes
Oct 27, 2022
68664dc
move out log softmax
Oct 27, 2022
e063b33
code for inference with TCPGen
Oct 28, 2022
df4a213
Merge branch 'rnntl-log-probs' into tcpgen
Oct 28, 2022
5e34548
changed debug training
Nov 4, 2022
f163b84
resolve version issue with pl
Nov 4, 2022
baeac25
Merge branch 'tcpgen' of https://github.com/BriansIDP/audio into tcpgen
Nov 4, 2022
6157143
updated train.sh and eval.sh scripts
Dec 4, 2022
ae1abc0
changed from 1024 bpe to 600 bpe
Dec 4, 2022
38506a1
Added biasing option
Dec 4, 2022
aca8118
Implemented loss calculation for TCPGen and added biasing option
Dec 4, 2022
9195e22
Added biasing option
Dec 4, 2022
bb65b04
Making it deterministic as set union is not
Dec 4, 2022
f494252
Added biasing option to toggle biasing
Dec 4, 2022
a048f4b
rare word f=15
Dec 4, 2022
52c1970
Newfiles: global_stats for clean 100 and rnnt decoding prototype with…
Dec 4, 2022
abba122
README
Dec 4, 2022
6c68eee
Add subset option
Dec 5, 2022
e5b2c8e
Added documentation
Dec 5, 2022
4a37816
Added prefix-based wordpiece search algorithm, and added documentation
Dec 5, 2022
3c534d0
Added documentation
Dec 5, 2022
ab2a553
Update README.md
BriansIDP Dec 5, 2022
303977d
Add scoring pipeline
Dec 5, 2022
12f6d66
Merge branch 'tcpgen' of https://github.com/BriansIDP/audio into tcpgen
Dec 5, 2022
c00678a
Update README.md
BriansIDP Dec 5, 2022
3c57570
Update README.md
BriansIDP Dec 5, 2022
a779026
formatting
Dec 5, 2022
a009973
removed newline
Dec 5, 2022
d7e7172
formatting
Dec 5, 2022
e4f8b1c
formatting
Dec 5, 2022
e0da63e
formatting
Dec 5, 2022
ec150b2
formatting
Dec 5, 2022
8f338b8
first commit BrianSun
Aug 6, 2022
0e3a652
second commit
Aug 6, 2022
b29f05a
add specific paths to cudnn
Aug 30, 2022
57e906f
constructing tree done
Sep 7, 2022
3c8212b
Add biasing LIBRISPEECH data processor
Sep 7, 2022
a76adbb
implemented tree search and DBRNNT training procedure
Sep 15, 2022
bd9651b
code for inference with TCPGen
Oct 28, 2022
244c58a
resolve version issue with pl
Nov 4, 2022
5cb47fc
add fused log smax option
Oct 26, 2022
123d386
fixes
Oct 27, 2022
f3b3caf
move out log softmax
Oct 27, 2022
5076b0a
updated train.sh and eval.sh scripts
Dec 4, 2022
2f9e012
changed from 1024 bpe to 600 bpe
Dec 4, 2022
9596ba1
Added biasing option
Dec 4, 2022
d3d3dcc
Implemented loss calculation for TCPGen and added biasing option
Dec 4, 2022
365439a
Added biasing option
Dec 4, 2022
f5199f2
Making it deterministic as set union is not
Dec 4, 2022
3a1bfe6
Added biasing option to toggle biasing
Dec 4, 2022
a4933ef
rare word f=15
Dec 4, 2022
36ff039
Newfiles: global_stats for clean 100 and rnnt decoding prototype with…
Dec 4, 2022
0583740
README
Dec 4, 2022
9f90fa5
Add subset option
Dec 5, 2022
86b3611
Added documentation
Dec 5, 2022
cb51322
Added prefix-based wordpiece search algorithm, and added documentation
Dec 5, 2022
d956261
Added documentation
Dec 5, 2022
bb1694e
Add scoring pipeline
Dec 5, 2022
882e2e5
Update README.md
BriansIDP Dec 5, 2022
1b1fa0a
Update README.md
BriansIDP Dec 5, 2022
971233d
Update README.md
BriansIDP Dec 5, 2022
78ec4e8
formatting
Dec 5, 2022
e8788c7
removed newline
Dec 5, 2022
72e090b
formatting
Dec 5, 2022
759217d
formatting
Dec 5, 2022
89f2a23
formatting
Dec 5, 2022
b1cff40
formatting
Dec 5, 2022
9c488b4
Merge branch 'tcpgen' of https://github.com/BriansIDP/audio into tcpgen
Dec 6, 2022
d9fb233
Upgrade nightly wheels to ROCm5.3 (#2853)
jithunnair-amd Dec 7, 2022
21a36d5
Introduce MUSAN dataset (#2888)
hwangjeff Dec 7, 2022
bc9702f
Add additive noise transform (#2889)
hwangjeff Dec 7, 2022
65dfe03
Add feature badges to preemphasis and deemphasis functions (#2892)
hwangjeff Dec 8, 2022
d653db6
Add HiFi GAN Generator to prototypes (#2860)
sgrigory Dec 8, 2022
f78d6c4
Fix docs warnings for conformer w2v2 (#2900)
Dec 8, 2022
cbcde5c
Follow up on WavLM bundles (#2895)
sgrigory Dec 8, 2022
2aa06d3
Toggle on/off ffmpeg test if needed (#2901)
atalman Dec 9, 2022
6149b3a
Fix wrong frame allocation in StreamWriter (#2905)
mthrok Dec 9, 2022
c56126a
Fix duplicated memory allocation in StreamWriter (#2906)
mthrok Dec 9, 2022
f420995
Update author and maintainer info (#2911)
mthrok Dec 9, 2022
b3c6a4d
Fix integration test for WAV2VEC2_ASR_LARGE_LV60K_10M (#2910)
nateanl Dec 9, 2022
98e51d8
Update model documentation structure (#2902)
mthrok Dec 10, 2022
25a6692
Fix type of arguments in torchaudio.io classes (#2913)
nateanl Dec 10, 2022
04aa82e
Update PR labels (#2912)
Dec 11, 2022
93e84be
implemented tree search and DBRNNT training procedure
Sep 15, 2022
c939856
code for inference with TCPGen
Oct 28, 2022
1561f3c
formatting
Dec 5, 2022
8b1f4a6
first commit BrianSun
Aug 6, 2022
f3b4996
second commit
Aug 6, 2022
8f995ec
add specific paths to cudnn
Aug 30, 2022
1c446a1
implemented tree search and DBRNNT training procedure
Sep 15, 2022
8bc5bc5
resolve version issue with pl
Nov 4, 2022
a8c9b3a
add fused log smax option
Oct 26, 2022
786e158
fixes
Oct 27, 2022
3190574
move out log softmax
Oct 27, 2022
d1dc749
formatting
Dec 5, 2022
4ad94df
Addressing comments about PR
Jan 20, 2023
89f0f8c
Addressed Comments on the PR
Jan 20, 2023
199ff86
Update README.md
BriansIDP Jan 20, 2023
3ee4558
Update README.md
BriansIDP Jan 20, 2023
899ada7
Merge remote-tracking branch 'upstream/main' into tcpgen
Jan 20, 2023
b8c3bfa
Merge branch 'tcpgen' of https://github.com/BriansIDP/audio into tcpgen
Jan 20, 2023
29c6999
change the name of the directory to be more clear
Jan 21, 2023
ea02d48
change the name of the directory to be more clear
Jan 21, 2023
8b3ea02
Solve bugs due to new updates from the main branch
Jan 22, 2023
4e132b7
temporarily commit for debugging
Jan 22, 2023
bd9ada9
current train.sh
Jan 22, 2023
c239488
Use hptr as input as a default
Jan 22, 2023
92a94a4
Addressing nateanl's comments
Jan 25, 2023
d12c3f2
changed the name of LIBRISPEECHBIASING to LibriSpeechBiasing
Feb 3, 2023
15e1276
removed train.sh and eval.sh files
Feb 3, 2023
f4139b1
addressing comments from @nateanl
Feb 3, 2023
51d5866
Added comments for arguments
Feb 9, 2023
28b2abc
formatted files
Feb 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
cmake_minimum_required(VERSION 3.17 FATAL_ERROR)

# Most of the configurations are taken from PyTorch
# https://github.com/pytorch/pytorch/blob/0c9fb4aff0d60eaadb04e4d5d099fb1e1d5701a9/CMakeLists.txt
Expand Down
75 changes: 75 additions & 0 deletions examples/asr/librispeech_conformer_rnnt_biasing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Contextual Conformer RNN-T with TCPGen Example

This directory contains sample implementations of training and evaluation pipelines for the Conformer RNN-T model with tree-constrained pointer generator (TCPGen) for contextual biasing, as described in the paper: [Tree-Constrained Pointer Generator for End-to-End Contextual Speech Recognition](https://ieeexplore.ieee.org/abstract/document/9687915)

## Setup
### Install PyTorch and TorchAudio nightly or from source
Because Conformer RNN-T is currently a prototype feature, you will need to either use the TorchAudio nightly build or build TorchAudio from source. Note also that GPU support is required for training.

To install the nightly, follow the directions at <https://pytorch.org/>.

To build TorchAudio from source, refer to the [contributing guidelines](https://github.com/pytorch/audio/blob/main/CONTRIBUTING.md).

### Install additional dependencies
```bash
pip install pytorch-lightning sentencepiece
```

## Usage

### Training

[`train.py`](./train.py) trains an Conformer RNN-T model with TCPGen on LibriSpeech using PyTorch Lightning. Note that the script expects users to have the following:
- Access to GPU nodes for training.
- Full LibriSpeech dataset.
- SentencePiece model to be used to encode targets; the model can be generated using [`train_spm.py`](./train_spm.py). **Note that suffix-based wordpieces are used in this example**. [`run_spm.sh`](./run_spm.sh) will generate 600 suffix-based wordpieces which is used in the paper.
- File (--global_stats_path) that contains training set feature statistics; this file can be generated using [`global_stats.py`](../emformer_rnnt/global_stats.py). The [`global_stats_100.json`](./global_stats_100.json) has been generated for train-clean-100.
- Biasing list: See [`rareword_f15.txt`](./blists/rareword_f15.txt) as an example for the biasing list used for training with clean-100 data. Words appeared less than 15 times were treated as rare words. For inference, [`all_rare_words.txt`](blists/all_rare_words.txt) which is the same list used in [https://github.com/facebookresearch/fbai-speech/tree/main/is21_deep_bias](https://github.com/facebookresearch/fbai-speech/tree/main/is21_deep_bias).

Additional training options:
- `--droprate` is the drop rate of biasing words appeared in the reference text to avoid over-confidence
- `--maxsize` is the size of the biasing list used for training, which is the sum of biasing words in reference and distractors

Sample SLURM command:
```
srun --cpus-per-task=16 --gpus-per-node=1 -N 1 --ntasks-per-node=1 python train.py --exp-dir <Path_to_exp> --librispeech-path <Path_to_librispeech_data> --sp-model-path ./spm_unigram_600_100suffix.model --biasing --biasing-list ./blists/rareword_f15.txt --droprate 0.1 --maxsize 200 --epochs 90
```

### Evaluation

[`eval.py`](./eval.py) evaluates a trained Conformer RNN-T model with TCPGen on LibriSpeech test-clean.

Additional decoding options:

- `--biasing-list` should be [`all_rare_words.txt`](blists/all_rare_words.txt) for Librispeech experiments
- `--droprate` normally should be 0 because we assume the reference biasing words are included
- `--maxsize` is the size of the biasing list used for decoding, where 1000 was used in the paper.

Sample SLURM command:
```
srun --cpus-per-task=16 --gpus-per-node=1 -N 1 --ntasks-per-node=1 python eval.py --checkpoint-path <Path_to_model_checkpoint> --librispeech-path <Path_to_librispeech_data> --sp-model-path ./spm_unigram_600_100suffix.model --expdir <Path_to_exp> --use-cuda --biasing --biasing-list ./blists/all_rare_words.txt --droprate 0.0 --maxsize 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in here

```

### Scoring
Need to install SCTK, the NIST Scoring Toolkit first following: [https://github.com/usnistgov/SCTK/blob/master/README.md](https://github.com/usnistgov/SCTK/blob/master/README.md)

Example scoring script using sclite is in [`score.sh`](./score.sh).

```
./score.sh <path_to_decoding_dir>
```

Note that this will generate a file named `results.wrd.txt` which is in the format that will be used in the following script to calculate rare word error rate. Follow these steps to calculate rare word error rate:

```bash
cd error_analysis
python get_error_word_count.py <path_to_results.wrd.txt>
```

Note that the `word_freq.txt` file contains word frequencies for train-clean-100 only. For the full set it should be calculated again, which will only slightly affect OOV word error rate calculation in this case.

The table below contains WER results for the test-clean sets using clean-100 training data. R-WER stands for rare word error rate, for words in the biasing list.

| | WER | R-WER |
|:-------------------:|-------------:|-----------:|
| test-clean | 0.0836 | 0.2366|
Loading