Skip to content

Commit eddd255

Browse files
committed
update to pytorch1.2 and python3
1 parent 90403b9 commit eddd255

34 files changed

+1367
-1422
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@ __pycache__/
33

44
my_863_corpus/*
55
log/
6+
checkpoint/
7+
data/

README.md

+25-27
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
1-
# End-to-End Automatic Speech recogniton
2-
This is an END-To-END system for speech recognition based on CTC implemented with pytorch.
1+
## Update:
2+
Update to pytorch1.2 and python3.
3+
4+
# CTC-based Automatic Speech Recogniton
5+
This is a CTC-based speech recognition system with pytorch.
36

47
At present, the system only supports phoneme recognition.
58

6-
You can also do it at word-level, but you may get a high error rate.
9+
You can also do it at word-level and may get a high error rate.
710

811
Another way is to decode with a lexcion and word-level language model using WFST which is not included in this system.
912

@@ -36,41 +39,35 @@ Chinese Corpus: 863 Corpus
3639

3740
## Install
3841
- Install [Pytorch](http://pytorch.org/)
39-
- Install [warp-ctc](https://github.com/SeanNaren/warp-ctc) and bind it to pytorch.
40-
Notice: If use python2, reinstall the pytorch with source code instead of pip.
41-
- Install pytorch audio:
42-
```bash
43-
sudo apt-get install sox libsox-dev libsox-fmt-all
44-
git clone https://github.com/pytorch/audio.git
45-
cd audio
46-
pip install cffi
47-
python setup.py install
48-
```
42+
- ~~Install [warp-ctc](https://github.com/SeanNaren/warp-ctc) and bind it to pytorch.~~
43+
~~Notice: If use python2, reinstall the pytorch with source code instead of pip.~~
44+
Use pytorch1.2 built-in CTC function(nn.CTCLoss) Now.
4945
- Install [Kaldi](https://github.com/kaldi-asr/kaldi). We use kaldi to extract mfcc and fbank.
50-
- Install [KenLM](https://github.com/kpu/kenlm). Training n-gram Languange Model if needed.
51-
- Install other python packages
46+
- Install pytorch [torchaudio](https://github.com/pytorch/audio.git)(This is needed when using waveform as input).
47+
- ~~Install [KenLM](https://github.com/kpu/kenlm). Training n-gram Languange Model if needed~~.
48+
Use Irstlm in kaldi tools instead.
49+
- Install and start visdom
5250
```
53-
pip install -r requirements.txt
51+
pip3 install visdom
52+
python -m visdom.server
5453
```
55-
- Start visdom
54+
- Install other python packages
5655
```
57-
python -m visdom.server
56+
pip install -r requirements.txt
5857
```
5958

6059
## Usage
61-
1. Install all the things according to the Install part.
62-
2. Open the top script run.sh and alter the directory of data and config file.
63-
3. Change the $feats if you want to use fbank or mfcc and revise conf file under the directory conf.
64-
4. Open the config file to revise the super-parameters about everything
60+
1. Install all the packages according to the Install part.
61+
2. Revise the top script run.sh.
62+
4. Open the config file to revise the super-parameters about everything.
6563
5. Run the top script with four conditions
6664
```bash
6765
bash run.sh data_prepare + AM training + LM training + testing
6866
bash run.sh 1 AM training + LM training + testing
6967
bash run.sh 2 LM training + testing
7068
bash run.sh 3 testing
7169
```
72-
LM training are not implemented yet. They are added to the todo-list.
73-
So only when you prepare the data, run.sh will work.
70+
RNN LM training is not implemented yet. They are added to the todo-list.
7471

7572
## Data Prepare
7673
1. Extract 39dim mfcc and 40dim fbank feature from kaldi.
@@ -81,17 +78,17 @@ So only when you prepare the data, run.sh will work.
8178
- RNN + DNN + CTC
8279
RNN here can be replaced by nn.LSTM and nn.GRU
8380
- CNN + RNN + DNN + CTC
84-
CNN is use to reduce the variety of spectrum which can be caused by the speaker and environment difference.
81+
CNN is use to reduce the variety of spectrum which can be caused by the speaker and environment difference.
8582
- How to choose
86-
Use add_cnn to choose one of two models. If add_cnn is True, then CNN+RNN+DNN+CTC will be chosen.
83+
Use add_cnn to choose one of two models. If add_cnn is True, then CNN+RNN+DNN+CTC will be chosen.
8784

8885
## Training:
8986
- initial-lr = 0.001
9087
- decay = 0.5
9188
- wight-decay = 0.005
9289

9390
Adjust the learning rate if the dev loss is around a specific loss for ten times.
94-
Times of adjusting learning rate is 8 which can be alter in steps/ctc_train.py(line367).
91+
Times of adjusting learning rate is 8 which can be alter in steps/train_ctc.py(line367).
9592
Optimizer is nn.optimizer.Adam with weigth decay 0.005
9693

9794
## Decoder
@@ -108,3 +105,4 @@ Phoneme-level language model is inserted to beam search decoder now.
108105
- Combine with RNN-LM
109106
- Beam search with RNN-LM
110107
- The code in 863_corpus is a mess. Need arranged.
108+

requirements.txt

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
h5py
21
numpy
32
scipy
4-
librosa
3+
visdom
File renamed without changes.

timit/conf/ctc_config.yaml

+60
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
#exp name and save dir
2+
exp_name: 'ctc_fbank_cnn'
3+
checkpoint_dir: 'checkpoint/'
4+
5+
#Data
6+
vocab_file: 'data/units'
7+
train_scp_path: 'data/train/fbank.scp'
8+
train_lab_path: 'data/train/phn_text'
9+
valid_scp_path: 'data/dev/fbank.scp'
10+
valid_lab_path: 'data/dev/phn_text'
11+
left_ctx: 0
12+
right_ctx: 2
13+
n_skip_frame: 2
14+
n_downsample: 2
15+
num_workers: 1
16+
shuffle_train: True
17+
feature_dim: 81
18+
output_class_dim: 39
19+
mel: False
20+
feature_type: "fbank"
21+
22+
#Model
23+
rnn_input_size: 243
24+
rnn_hidden_size: 384
25+
rnn_layers: 4
26+
rnn_type: "nn.LSTM"
27+
bidirectional: True
28+
batch_norm: True
29+
drop_out: 0.2
30+
31+
#CNN
32+
add_cnn: True
33+
layers: 2
34+
channel: "[(1, 32), (32, 32)]"
35+
kernel_size: "[(3, 3), (3, 3)]"
36+
stride: "[(1, 2), (2, 2)]"
37+
padding: "[(1, 1), (1, 1)]"
38+
pooling: "None"
39+
batch_norm: True
40+
activation_function: "relu"
41+
42+
#[Training]
43+
use_gpu: True
44+
init_lr: 0.001
45+
num_epoches: 500
46+
end_adjust_acc: 2
47+
lr_decay: 0.5
48+
batch_size: 8
49+
weight_decay: 0.0005
50+
seed: 1
51+
verbose_step: 50
52+
53+
#[test]
54+
test_scp_path: 'data/test/fbank.scp'
55+
test_lab_path: 'data/test/phn_text'
56+
decode_type: "Greedy"
57+
beam_width: 10
58+
lm_alpha: 0.1
59+
lm_path: 'data/lm_phone_bg.arpa'
60+

timit/conf/ctc_model_setting_fbank.conf

-46
This file was deleted.

timit/conf/ctc_model_setting_mfcc.conf

-46
This file was deleted.

timit/conf/fbank.conf

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
--window-type=hamming
2-
--num-mel-bins=40
2+
--num-mel-bins=80
3+
--use-energy
34

0 commit comments

Comments
 (0)