training with multiple machines #3966

qindazhu · 2020-03-01T13:06:58Z

Remove run_cleanup_segmentation.sh as it does not make the result better.
Remove duplicate code and support single-GPU/multiple-GPU/multiple-machine training in one script
Add online-cmvn as @fanlu showed before

Result

	tdnn_1c_rd_rmc_rng_without_cleanup	tdnn_1c_rd_rmc_rng
dev_cer	5.92	5.99
dev_wer	13.71	13.86
test_cer	7.03	7.08
test_wer	15.35	15.72

	TDNN-F(Pytorch, Adam, fanlu's previous result )	TDNN-F(Pytorch, Adam, this pullrequest with 4GPU)
dev_cer	6.16	6.29
dev_wer	14.01	14.10
test_cer	7.31	7.57
test_wer	15.97	15.80

danpovey · 2020-03-01T14:45:02Z

OK, I assume this is OK to merge? Any objections?

egs/aishell/s10/chain/device_utils.py

egs/aishell/s10/chain/inference.py

csukuangfj · 2020-03-02T01:30:38Z

egs/aishell/s10/chain/inference.py

@@ -28,7 +29,11 @@ def main():
        logging.warning('No GPU detected! Use CPU for inference.')
        device = torch.device('cpu')
    else:
-        device = torch.device('cuda', args.device_id)
+        devices = allocate_gpu_devices(1)


Should we follow the kaldi style SelectGpu() to select a gpu that has the largest available memory?

I prefer to do this with --q options in queue.pl, considering GPUs on a single machine nowdays are commonly with same type (and memory) and we suppose they are in Exclusive_Process mode

csukuangfj · 2020-03-02T01:31:53Z

egs/aishell/s10/chain/inference.py

@@ -28,7 +29,11 @@ def main():
        logging.warning('No GPU detected! Use CPU for inference.')
        device = torch.device('cpu')
    else:
-        device = torch.device('cuda', args.device_id)
+        devices = allocate_gpu_devices(1)


What if the user provides device_id at the commandline?

Let me do this in another pr, I think what we need is device_ids instead of device_id. Actually I don't think about this very clearly now as:

Multiple gpus is required when training with multiple GPUs on a single machine, that's why we need device_ids

Assign GPU id when training with multiple machines is relatively complex for uses.

@danpovey any comments about this?

egs/aishell/s10/chain/train.py

egs/aishell/s10/local/run_chain.sh

csukuangfj · 2020-03-02T01:43:51Z

Remove run_cleanup_segmentation.sh as it does help to the WER/CER

does help OR does NOT help

improve CER/WER may be better than help to CER/WER.

qindazhu · 2020-03-02T03:37:50Z

It does not help. Thanks, updated the text.

We can't say improve here, it just does not make a different by adding or removing cleanup, at least according to my experiment.

Remove run_cleanup_segmentation.sh as it does help to the WER/CER

does help OR does NOT help

improve CER/WER may be better than help to CER/WER.

csukuangfj · 2020-03-02T03:43:16Z

I just feel that it's unnatural to say help + to + noun

qindazhu · 2020-03-02T03:59:56Z

OK, updated.

danpovey · 2020-03-02T04:38:35Z

Merged, thanks!!

fanlu · 2020-03-02T05:38:16Z

I have implemented online-ivector in pytorch model, but I get no gain but worse result when using online-ivector. @qindazhu @csukuangfj please review this code.
this code is to inference nnet_output by input chunk.
in inference.py

    for batch_idx, batch in enumerate(dataloader):
        key_list, padded_feat, output_len_list, padded_ivector, ivector_len_list = batch
        padded_feat = padded_feat.to(device)
        padded_ivector = padded_ivector.to(device)
        with torch.no_grad():
            nnet_outputs = []
            input_num_frames = padded_feat.shape[1] + 2 - args.model_left_context - args.model_right_context
            # 17 chunk_len same as kaldi
            for i in range(0, output_len_list[0], 17):
                # input_len 418-> [0, 17, 34, 51, 68, 85, 102, 119, 136]
                first_output = i * 3
                last_output = min(first_output + (17-1) * 3, input_num_frames)
                first_input = first_output
                last_input = last_output + args.model_left_context + args.model_right_context
                input_x = padded_feat[:, first_input:last_input+1, :]
                ivector_index = (first_output + last_output) // 2 // 10
                input_ivector = padded_ivector[:, ivector_index, :]
                feat = torch.cat((input_x, input_ivector.repeat((1, input_x.shape[1], 1))), dim=-1)
                nnet_output_temp, _ = model(feat)
                nnet_outputs.append(nnet_output_temp)
            nnet_output = torch.cat(nnet_outputs, dim=1)

this is the result.

exp ddp	test cer	test wer	dev cer	dev wer	global objf	validation objf	output-affine
ivector l2 5e-5	8.27	17.12	7.10	15.20	-0.050093	-0.065751	143.3

csukuangfj · 2020-03-02T05:39:59Z

Could you please open a pullrequest so that we can view the whole picture?

fanlu · 2020-03-02T05:40:53Z

OK