support ivector training in pytorch model #3969

fanlu · 2020-03-02T15:54:23Z

update latest result

	TDNN-F(Pytorch, Adam, fanlu's previous result )	TDNN-F(Pytorch, Adam, haowen's previous result with 4GPU)	this pull request base on haowen's pr #3966
dev_cer	6.16	6.29	6.18
dev_wer	14.01	14.10	13.96
test_cer	7.31	7.57	7.59
test_wer	15.97	15.80	15.86

egs/aishell/s10/chain/feat_dataset.py

egs/aishell/s10/chain/egs_dataset.py

egs/aishell/s10/local/run_ivector_common.sh

danpovey · 2020-03-03T01:48:43Z

Thanks a lot for reviewing!

…

On Tue, Mar 3, 2020 at 9:47 AM Fangjun Kuang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/aishell/s10/local/run_ivector_common.sh <#3969 (comment)>: > + ${temp_data_root}/${train_set}_sp_hires_max2 \ + exp/nnet3${nnet3_affix}/extractor $ivectordir + +fi + +if [[ $stage -le 8 ]]; then + # Also extract iVectors for the test data, but in this case we don't need the speed + # perturbation (sp) or small-segment concatenation (comb). + for data in dev test; do + steps/online/nnet2/extract_ivectors_online.sh --cmd "$train_cmd" --nj 10 \ + data/${data}_hires exp/nnet3${nnet3_affix}/extractor \ + exp/nnet3${nnet3_affix}/ivectors_${data}_hires + done +fi + +exit 0; this file should end with a newline. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3969?email_source=notifications&email_token=AAZFLO2DVOTQ4UXOGPS7GC3RFROS5A5CNFSM4K7XDWTKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCXU67ZI#pullrequestreview-367652837>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO46SQT5HSXGZ6DFFHTRFROS5ANCNFSM4K7XDWTA> .

egs/aishell/s10/chain/feat_dataset.py

danpovey · 2020-03-03T02:30:21Z

Guys, I just want to mention something... I think it would be better if we shifted (not necessarily right now..) to, instead of exposing the Kaldi egs as a Dataset, exposing them as a DataLoader. That way we could use the existing command-line tools for things like shuffling and time-shifting, and it will be much more efficient for I/O. The idea is that the dataloader would, on every epoch, create a suitable command line and read from it as a pipe. If it was a distributed data-loader, probably the easiest way to do it would be to make sure there is an appropriately split scp file and give it the appropriate one. We could the scripts here #3765 to generate the scp files. I want to merge this soon; one option is to merge into pybind11 first to test it.

…

On Tue, Mar 3, 2020 at 10:11 AM fanlu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/aishell/s10/chain/feat_dataset.py <#3969 (comment)>: > with open(feats_scp, 'r') as f: for line in f: split = line.split() assert len(split) == 2 - items.append(split) - - self.items = items + uttid, rxfilename =split OK — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3969?email_source=notifications&email_token=AAZFLO6XPUG7BA6VEDOUVETRFRRNRA5CNFSM4K7XDWTKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCXVAYYY#discussion_r386762254>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO2XIWLBFIYM4T3DZQ3RFRRNRANCNFSM4K7XDWTA> .

fanlu · 2020-03-03T02:56:00Z

@csukuangfj I have fixed the code with your suggestion. please have a look.
@csukuangfj @qindazhu please test this pr at your convenience due to Dan's advice. Thanks.

qindazhu · 2020-03-03T03:15:26Z

OK, I'll run it after it's merged.

qindazhu · 2020-03-03T03:19:12Z

I'll take a look at that pr and start to do this.

Guys, I just want to mention something... I think it would be better if we shifted (not necessarily right now..) to, instead of exposing the Kaldi egs as a Dataset, exposing them as a DataLoader. That way we could use the existing command-line tools for things like shuffling and time-shifting, and it will be much more efficient for I/O. The idea is that the dataloader would, on every epoch, create a suitable command line and read from it as a pipe. If it was a distributed data-loader, probably the easiest way to do it would be to make sure there is an appropriately split scp file and give it the appropriate one. We could the scripts here #3765 to generate the scp files. I want to merge this soon; one option is to merge into pybind11 first to test it.
…
On Tue, Mar 3, 2020 at 10:11 AM fanlu @.> wrote: @.* commented on this pull request. ------------------------------ In egs/aishell/s10/chain/feat_dataset.py <#3969 (comment)>: > with open(feats_scp, 'r') as f: for line in f: split = line.split() assert len(split) == 2 - items.append(split) - - self.items = items + uttid, rxfilename =split OK — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3969?email_source=notifications&email_token=AAZFLO6XPUG7BA6VEDOUVETRFRRNRA5CNFSM4K7XDWTKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCXVAYYY#discussion_r386762254>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2XIWLBFIYM4T3DZQ3RFRRNRANCNFSM4K7XDWTA .

danpovey · 2020-03-03T03:19:26Z

OK, merging.

danpovey · 2020-03-03T03:21:19Z

Great, thanks! Firstly, just doing the merge and figuring out how to use those newer scripts to prepare the egs would be a great start.

…

On Tue, Mar 3, 2020 at 11:19 AM Haowen Qiu ***@***.***> wrote: I'll take a look at that pr and start to do this. Guys, I just want to mention something... I think it would be better if we shifted (not necessarily right now..) to, instead of exposing the Kaldi egs as a Dataset, exposing them as a DataLoader. That way we could use the existing command-line tools for things like shuffling and time-shifting, and it will be much more efficient for I/O. The idea is that the dataloader would, on every epoch, create a suitable command line and read from it as a pipe. If it was a distributed data-loader, probably the easiest way to do it would be to make sure there is an appropriately split scp file and give it the appropriate one. We could the scripts here #3765 <#3765> to generate the scp files. I want to merge this soon; one option is to merge into pybind11 first to test it. … <#m_3185835891003743640_> On Tue, Mar 3, 2020 at 10:11 AM fanlu *@*.*> wrote: @.** commented on this pull request. ------------------------------ In egs/aishell/s10/chain/feat_dataset.py <#3969 (comment) <#3969 (comment)>>: > with open(feats_scp, 'r') as f: for line in f: split = line.split() assert len(split) == 2 - items.append(split) - - self.items = items + uttid, rxfilename =split OK — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3969 <#3969>?email_source=notifications&email_token=AAZFLO6XPUG7BA6VEDOUVETRFRRNRA5CNFSM4K7XDWTKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCXVAYYY#discussion_r386762254>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2XIWLBFIYM4T3DZQ3RFRRNRANCNFSM4K7XDWTA . — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3969?email_source=notifications&email_token=AAZFLO22JMLLIJ7M3UMK42LRFRZLFA5CNFSM4K7XDWTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENR5TZI#issuecomment-593746405>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO63INWGRC6Y5NEM6SLRFRZLFANCNFSM4K7XDWTA> .

fanlu added 2 commits March 2, 2020 21:48

support ivector training in pytorch model

d39e3cd

add ivector training script

c1216dc