-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: begin to add CTC training with kaldi pybind and PyTorch. #3947
base: pybind11
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. Bear in mind that my intention when implementing CTC is to allow the supervision to be generic FSTs, not limited to linear sequences. This may already be what you are doing. This will allow dictionaries with multiple entries for words, for instance; and optional silence. The forward backward code would do the same as our current numerator forward backward code, but I want to implement it on the GPU. Meixu was going to look into GPU programming for this. I could help myself as well, I wrote Kaldi's denominator forward-backward code.
parser = argparse.ArgumentParser(description='convert text to labels') | ||
|
||
parser.add_argument('--lexicon-filename', dest='lexicon_filename', type=str) | ||
parser.add_argument('--tokens-filename', dest='tokens_filename', type=str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the standard OpenFST symbol-table format for these tokens.
I'm open to other opinions, but since we'll probably have these symbols present in FSTs I think symbol 0 should be reserved for and should be 1, and we can just apply an offset of 1 when interpreting the nnet outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... if the format is already the symbol-table format, bear in mind that the order of lines is actually arbitrary;what matters is the integer there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reuse the notation from EESEN (https://github.com/srvk/eesen), which calls
phones.txt
as tokens.txt
.
tokens.txt
is acutally a phone symbol table, with
<eps> 0
<blk> 1
other phones
The code here does not pose any constraint on the order of lines. What
matters here is only the integer of symbols. The first two integers 0
and 1
are reserved. I think 0
is reserved for <eps>
. Here I reserve 1
for
the blank symbol.
The script generating tokens.txt
has considered the above constraint.
Since there is a T
in TLG.fst
, I keep using tokens.txt
here instead
of phones.txt
. I can switch to phones.txt
if you think that is more natural
in kaldi.
I am intending to use Baidu's warp-ctc (https://github.com/baidu-research/warp-ctc) Both of them do not support words with multiple pronunciations. I currently use only Can we first implement a baseline that considers only one pronunciation? |
Sure, implementing a baseline is fine.
…On Thu, Feb 20, 2020 at 10:01 PM Fangjun Kuang ***@***.***> wrote:
I am intending to use Baidu's warp-ctc (
https://github.com/baidu-research/warp-ctc)
or PyTorch's builtin CTCLoss (
https://pytorch.org/docs/stable/nn.html#torch.nn.CTCLoss).
Both of them do not support words with multiple pronunciations. I
currently use only
the pronunciation of a word when it first appears and ignore other
alternative pronunciations.
Can we first implement a baseline that considers only one pronunciation?
Since this approach is the easiest one and we can reuse existing APIs to
compute ctc loss.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3947?email_source=notifications&email_token=AAZFLO5UA6U6QXNDBY57O5TRD2EL3A5CNFSM4KYMRNQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMODGNY#issuecomment-589050679>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO7G53FF2G3F5GXXOJDRD2EL3ANCNFSM4KYMRNQA>
.
|
b32e0f7
to
0148732
Compare
The loss drops from 83 to 4.8 after 100 batches and stops decreasing. I am trying |
For the CTC system you'll normally want to end with an LSTM layer so it
can make the output spiky.
…On Fri, Feb 21, 2020 at 10:05 PM Fangjun Kuang ***@***.***> wrote:
The loss drops from 83 to 4.8 after 100 batches and stops decreasing. I am
trying
to find out the reason.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3947?email_source=notifications&email_token=AAZFLOYUPXVPDWMMMERGPTDRD7NSFA5CNFSM4KYMRNQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMSZRNA#issuecomment-589666484>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO4IQOT4EPEPPQBNTH3RD7NSFANCNFSM4KYMRNQA>
.
|
4fcfb19
to
ffa861c
Compare
@danpovey thanks My current network architecture is
Do you mean I should replace |
No it should be OK as long as there are LSTM layers in there.. You could
try removing the prefinal layer though.
You may have to play with the l2 and learning rates a bit.
…On Fri, Feb 21, 2020 at 10:20 PM Fangjun Kuang ***@***.***> wrote:
@danpovey <https://github.com/danpovey> thanks
My current network architecture is
layer0: input-batchnorm
layer1: lstm
layer2: projection + tanh
layer3: lstm
layer4: projection + tanh
layer5: lstm
layer6: projection + tanh
layer7: lstm
layer8: projection + tanh
layer9: prefinal-affine
layer10: log-softmax
Do you mean I should replace layer9: prefinal-affine with an LSTM layer ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3947?email_source=notifications&email_token=AAZFLO5L6NXTTIMPD6DN2S3RD7PLNA5CNFSM4KYMRNQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMS3DWY#issuecomment-589672923>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO5ACOTRO2VINEQH3ATRD7PLNANCNFSM4KYMRNQA>
.
|
@danpovey That is, to replace
with
|
Normalization is not necessary because this component would typically be
followed by a batchnorm component.
BTW, regarding the Conv1d-type layout, make sure it is documented.
…On Mon, Feb 24, 2020 at 4:41 PM Fangjun Kuang ***@***.***> wrote:
@danpovey <https://github.com/danpovey>
do we need to *normalize* the coefficients?
That is, to replace
- [-1, 0, 1]
- [1, 0, -2, 0, 1]
with
- [-0.5, 0, 0.5]
- [0.25, 0, -0.5, 0, 0.25]
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3947?email_source=notifications&email_token=AAZFLO3OIOOUQ4UJ6XG2UKTREOB4ZA5CNFSM4KYMRNQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMW7ZCI#issuecomment-590216329>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO2BX77DUSJRLI4K3VTREOB4ZANCNFSM4KYMRNQA>
.
|
We need to replace LSTM with TDNN since it is difficult to converge for LSTM.
Still working in progress. Now if you use only 8 wave files for training and decoding, you can I find that the LSTM model is difficult to converge. I'm going to replace it with TDNN-F. |
Decode results for the current pullrequest are as follows
The first column uses nearly the same TDNN-F model architecture as the remaining columns It takes about 59 minutes per epoch and the decode results are for the 12th epoch. There is still a big gap in CER/WER of this pullrequet compared with the chain model. |
@csukuangfj If you are implementing CTC-CRF and if I understand it correctly from this paper |
thanks. I am still in the learning mode and have read the implementation of Kaldi's |
@danpovey
Different from the chain training part, examples in the same batch of ctc training I would like to implement a class Would the above differences (or tricks in Kaldi) make a significant impact on the final training result? |
@csukuangfj I don't think those things would make a significant difference. Also I don't think that name is really appropriate. Firstly it isn't CTC (the special characteristic of CTC is that it is normalized to 1 so needs no denominator computation). Regarding the different sequence lengths... it would be interesting to allow the denominator computation to handle different sequence lengths (the numerator as well). The difficulty is doing this efficiently. GPU programming is only efficient if you can form suitable batches. What I was thinking was, it might be better to still have batches with fixed-size elements, but instead focus on allowing the numerator computation to work with that. If we had to break sentences into pieces, we could either handle it by constraining to an FST like Kaldi's current implementation of chain training, or use the regular CTC-type FST but allow all states to be initial and final, so the sequence can start and end in the middle. |
I am trying to sort examples by their sequence lengths so that there is as less padding
I'm still learning the internals of chain training. We can come to this when the handling |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it. |
working in progress.