1
- # End-to-End Automatic Speech recogniton
2
- This is an END-To-END system for speech recognition based on CTC implemented with pytorch.
1
+ ## Update:
2
+ Update to pytorch1.2 and python3.
3
+
4
+ # CTC-based Automatic Speech Recogniton
5
+ This is a CTC-based speech recognition system with pytorch.
3
6
4
7
At present, the system only supports phoneme recognition.
5
8
6
- You can also do it at word-level, but you may get a high error rate.
9
+ You can also do it at word-level and may get a high error rate.
7
10
8
11
Another way is to decode with a lexcion and word-level language model using WFST which is not included in this system.
9
12
@@ -36,41 +39,35 @@ Chinese Corpus: 863 Corpus
36
39
37
40
## Install
38
41
- Install [ Pytorch] ( http://pytorch.org/ )
39
- - Install [ warp-ctc] ( https://github.com/SeanNaren/warp-ctc ) and bind it to pytorch.
40
- Notice: If use python2, reinstall the pytorch with source code instead of pip.
41
- - Install pytorch audio:
42
- ``` bash
43
- sudo apt-get install sox libsox-dev libsox-fmt-all
44
- git clone https://github.com/pytorch/audio.git
45
- cd audio
46
- pip install cffi
47
- python setup.py install
48
- ```
42
+ - ~~ Install [ warp-ctc] ( https://github.com/SeanNaren/warp-ctc ) and bind it to pytorch.~~
43
+ ~~ Notice: If use python2, reinstall the pytorch with source code instead of pip.~~
44
+ Use pytorch1.2 built-in CTC function(nn.CTCLoss) Now.
49
45
- Install [ Kaldi] ( https://github.com/kaldi-asr/kaldi ) . We use kaldi to extract mfcc and fbank.
50
- - Install [ KenLM] ( https://github.com/kpu/kenlm ) . Training n-gram Languange Model if needed.
51
- - Install other python packages
46
+ - Install pytorch [ torchaudio] ( https://github.com/pytorch/audio.git ) (This is needed when using waveform as input).
47
+ - ~~ Install [ KenLM] ( https://github.com/kpu/kenlm ) . Training n-gram Languange Model if needed~~ .
48
+ Use Irstlm in kaldi tools instead.
49
+ - Install and start visdom
52
50
```
53
- pip install -r requirements.txt
51
+ pip3 install visdom
52
+ python -m visdom.server
54
53
```
55
- - Start visdom
54
+ - Install other python packages
56
55
```
57
- python -m visdom.server
56
+ pip install -r requirements.txt
58
57
```
59
58
60
59
## Usage
61
- 1 . Install all the things according to the Install part.
62
- 2 . Open the top script run.sh and alter the directory of data and config file.
63
- 3 . Change the $feats if you want to use fbank or mfcc and revise conf file under the directory conf.
64
- 4 . Open the config file to revise the super-parameters about everything
60
+ 1 . Install all the packages according to the Install part.
61
+ 2 . Revise the top script run.sh.
62
+ 4 . Open the config file to revise the super-parameters about everything.
65
63
5 . Run the top script with four conditions
66
64
``` bash
67
65
bash run.sh data_prepare + AM training + LM training + testing
68
66
bash run.sh 1 AM training + LM training + testing
69
67
bash run.sh 2 LM training + testing
70
68
bash run.sh 3 testing
71
69
```
72
- LM training are not implemented yet. They are added to the todo-list.
73
- So only when you prepare the data, run.sh will work.
70
+ RNN LM training is not implemented yet. They are added to the todo-list.
74
71
75
72
## Data Prepare
76
73
1 . Extract 39dim mfcc and 40dim fbank feature from kaldi.
@@ -81,17 +78,17 @@ So only when you prepare the data, run.sh will work.
81
78
- RNN + DNN + CTC
82
79
RNN here can be replaced by nn.LSTM and nn.GRU
83
80
- CNN + RNN + DNN + CTC
84
- CNN is use to reduce the variety of spectrum which can be caused by the speaker and environment difference.
81
+ CNN is use to reduce the variety of spectrum which can be caused by the speaker and environment difference.
85
82
- How to choose
86
- Use add_cnn to choose one of two models. If add_cnn is True, then CNN+RNN+DNN+CTC will be chosen.
83
+ Use add_cnn to choose one of two models. If add_cnn is True, then CNN+RNN+DNN+CTC will be chosen.
87
84
88
85
## Training:
89
86
- initial-lr = 0.001
90
87
- decay = 0.5
91
88
- wight-decay = 0.005
92
89
93
90
Adjust the learning rate if the dev loss is around a specific loss for ten times.
94
- Times of adjusting learning rate is 8 which can be alter in steps/ctc_train .py(line367).
91
+ Times of adjusting learning rate is 8 which can be alter in steps/train_ctc .py(line367).
95
92
Optimizer is nn.optimizer.Adam with weigth decay 0.005
96
93
97
94
## Decoder
@@ -108,3 +105,4 @@ Phoneme-level language model is inserted to beam search decoder now.
108
105
- Combine with RNN-LM
109
106
- Beam search with RNN-LM
110
107
- The code in 863_corpus is a mess. Need arranged.
108
+
0 commit comments