Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: begin to add CTC training with kaldi pybind and PyTorch. #3947

Open
wants to merge 11 commits into
base: pybind11
Choose a base branch
from

Conversation

csukuangfj
Copy link
Contributor

@csukuangfj csukuangfj commented Feb 20, 2020

working in progress.

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Bear in mind that my intention when implementing CTC is to allow the supervision to be generic FSTs, not limited to linear sequences. This may already be what you are doing. This will allow dictionaries with multiple entries for words, for instance; and optional silence. The forward backward code would do the same as our current numerator forward backward code, but I want to implement it on the GPU. Meixu was going to look into GPU programming for this. I could help myself as well, I wrote Kaldi's denominator forward-backward code.

parser = argparse.ArgumentParser(description='convert text to labels')

parser.add_argument('--lexicon-filename', dest='lexicon_filename', type=str)
parser.add_argument('--tokens-filename', dest='tokens_filename', type=str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the standard OpenFST symbol-table format for these tokens.
I'm open to other opinions, but since we'll probably have these symbols present in FSTs I think symbol 0 should be reserved for and should be 1, and we can just apply an offset of 1 when interpreting the nnet outputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... if the format is already the symbol-table format, bear in mind that the order of lines is actually arbitrary;what matters is the integer there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reuse the notation from EESEN (https://github.com/srvk/eesen), which calls
phones.txt as tokens.txt.

tokens.txt is acutally a phone symbol table, with

<eps> 0
<blk> 1
other phones

The code here does not pose any constraint on the order of lines. What
matters here is only the integer of symbols. The first two integers 0 and 1
are reserved. I think 0 is reserved for <eps>. Here I reserve 1 for
the blank symbol.

The script generating tokens.txt has considered the above constraint.


Since there is a T in TLG.fst, I keep using tokens.txt here instead
of phones.txt. I can switch to phones.txt if you think that is more natural
in kaldi.

egs/aishell/s10b/local/token_to_fst.py Show resolved Hide resolved
@csukuangfj
Copy link
Contributor Author

I am intending to use Baidu's warp-ctc (https://github.com/baidu-research/warp-ctc)
or PyTorch's builtin CTCLoss (https://pytorch.org/docs/stable/nn.html#torch.nn.CTCLoss).

Both of them do not support words with multiple pronunciations. I currently use only
the pronunciation of a word when it first appears and ignore other alternative pronunciations.

Can we first implement a baseline that considers only one pronunciation?
Since this approach is the easiest one and we can reuse existing APIs to compute ctc loss.

@danpovey
Copy link
Contributor

danpovey commented Feb 20, 2020 via email

@csukuangfj
Copy link
Contributor Author

The loss drops from 83 to 4.8 after 100 batches and stops decreasing. I am trying
to find out the reason.

@danpovey
Copy link
Contributor

danpovey commented Feb 21, 2020 via email

@csukuangfj
Copy link
Contributor Author

@danpovey thanks

My current network architecture is

layer0: input-batchnorm

layer1: lstm
layer2: projection + tanh

layer3: lstm
layer4: projection + tanh

layer5: lstm
layer6: projection + tanh

layer7: lstm
layer8: projection + tanh

layer9: prefinal-affine
layer10: log-softmax

Do you mean I should replace layer9: prefinal-affine with an LSTM layer ?

@danpovey
Copy link
Contributor

danpovey commented Feb 21, 2020 via email

@csukuangfj
Copy link
Contributor Author

@danpovey
do we need to normalize the coefficients?

That is, to replace

  • [-1, 0, 1]
  • [1, 0, -2, 0, 1]

with

  • [-0.5, 0, 0.5]
  • [0.25, 0, -0.5, 0, 0.25]

@danpovey
Copy link
Contributor

danpovey commented Feb 24, 2020 via email

We need to replace LSTM with TDNN since it is difficult
to converge for LSTM.
@csukuangfj
Copy link
Contributor Author

Still working in progress.

Now if you use only 8 wave files for training and decoding, you can
get a CER as low as 0.04, which verifies that the pipleline is working.

I find that the LSTM model is difficult to converge. I'm going to replace it with TDNN-F.

@csukuangfj
Copy link
Contributor Author

Decode results for the current pullrequest are as follows

this pullrequest haowen's PyTorch(#3925) fanlu's PyTorch (#3940) with online cmvn haowen's kaldi (#3925)
test cer 12.91 7.86 7.31 7.08
test wer 21.90 16.56 15.97 15.72
dev cer 11.81 6.47 6.16 5.99
dev wer 20.46 14.45 14.01 13.86

The first column uses nearly the same TDNN-F model architecture as the remaining columns
except that it does not have xent regularizer. In addition, the first column uses CTC loss
instead of chain loss.

It takes about 59 minutes per epoch and the decode results are for the 12th epoch.
The ctc loss value is 0.086.

There is still a big gap in CER/WER of this pullrequet compared with the chain model.
I will add an LSTM layer before the output affine layer, add spectral augmentation,
and run the training again.

@danpovey
Copy link
Contributor

@csukuangfj If you are implementing CTC-CRF and if I understand it correctly from this paper
http://oa.ee.tsinghua.edu.cn/~ouzhijian/pdf/ctc-crf.pdf
it is the same as LF-MMI except there is no context dependency and the self-loop (blank symbol) is shared between all phones, and there is no optional silence. The current tree-building mechanism in Kaldi doesn't allow for one pdf to be shared and the other not; however, I do remember doing experiments with making the blank shared vs. not shared (I don't recall how) and not shared was a bit better.
It should be possible to use the same forward-backward code for both numerator and denominator, for CTC-CRF as LF-MMI.

@csukuangfj
Copy link
Contributor Author

@danpovey

thanks. I am still in the learning mode and have read the implementation of Kaldi's
denominator computation. I find that the denominator part of CTC-CRF is adapted
from Kaldi's code. It takes more than 3 days for CTC-CRF for the AIShell dataset
to get a reasonable CER, which is too slow and unacceptable. I am trying to figure
out the reason and to reuse as much code from Kaldi as possible.

@csukuangfj
Copy link
Contributor Author

csukuangfj commented Mar 11, 2020

@danpovey
I've read the denominator implementation of CTC CRF and find that it is
a re-implementation of the chain denominator computation with the following differences:

  • It does not perform 100 iterations of the FST to get the initial probability for each state;
    only the start states take the probability from the state weight; all other states have
    initial probability 0; The initial state probabilities are not normalized.

  • it has no leaky hmm

  • it performs the computation in the log space

Different from the chain training part, examples in the same batch of ctc training
do not have the same number of frames, soDenominatorComputation in src/chain
cannot be used directly for ctc training.

I would like to implement a class CtcDenominatorCompuation that can handle
examples with different sequence lengths in the same batch.

Would the above differences (or tricks in Kaldi) make a significant impact on the final training result?

@danpovey
Copy link
Contributor

@csukuangfj I don't think those things would make a significant difference. Also I don't think that name is really appropriate. Firstly it isn't CTC (the special characteristic of CTC is that it is normalized to 1 so needs no denominator computation). Regarding the different sequence lengths... it would be interesting to allow the denominator computation to handle different sequence lengths (the numerator as well). The difficulty is doing this efficiently. GPU programming is only efficient if you can form suitable batches. What I was thinking was, it might be better to still have batches with fixed-size elements, but instead focus on allowing the numerator computation to work with that. If we had to break sentences into pieces, we could either handle it by constraining to an FST like Kaldi's current implementation of chain training, or use the regular CTC-type FST but allow all states to be initial and final, so the sequence can start and end in the middle.

@csukuangfj
Copy link
Contributor Author

I am trying to sort examples by their sequence lengths so that there is as less padding
as possible in the same batch.


What I was thinking was, it might be better to still have batches with fixed-size elements, but instead focus on allowing the numerator computation to work with that.

I'm still learning the internals of chain training. We can come to this when the handling
of different sequence lengths is finished; at least, it can be used as a baseline.

@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale bot on the loose label Jun 19, 2020
@stale
Copy link

stale bot commented Jul 19, 2020

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

@stale stale bot closed this Jul 19, 2020
@kkm000 kkm000 reopened this Jul 19, 2020
@stale stale bot removed the stale Stale bot on the loose label Jul 19, 2020
@kkm000 kkm000 added the stale-exclude Stale bot ignore this issue label Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale-exclude Stale bot ignore this issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants