Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scripts] Add layer for attention with bypass #3694

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

danpovey
Copy link
Contributor

No description provided.

@danpovey
Copy link
Contributor Author

Note, this is a just a draft, pending experiments.

@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale bot on the loose label Jun 19, 2020
@kkm000 kkm000 added in progress Issue has been taken and is being worked on stale-exclude Stale bot ignore this issue labels Jul 15, 2020
@stale stale bot removed the stale Stale bot on the loose label Jul 15, 2020
@kkm000 kkm000 marked this pull request as draft July 15, 2020 09:58
@kkm000
Copy link
Contributor

kkm000 commented Aug 29, 2021

@danpovey, I don't want to lose this. I'd take it over. What did you mean by "not working well:" WER or convergence or other things? If you can remember, of course... :)


tdnn_opts="l2-regularize=0.02 dropout-proportion=0.0 dropout-per-dim-continuous=true"
tdnnf_opts="l2-regularize=0.02 dropout-proportion=0.0 bypass-scale=0.8"
attention_opts="l2-regularize=0.02 num-heads=2 num-left-inputs=5 num-left-inputs-required=1 num-right-inputs=2 num-right-inputs-required=1 dropout-proportion=0.0 bypass-scale=0.8"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This num-left-inputs and num-right-inputs was likely way to small, something much larger like 20 to 40 might be more optimal. I don't know how efficient this would be... hopefully OK, I don't remember too much of the internal implementation.
It might make sense to combine this with, instead of tdnn-f layers, or perhaps in addition, residual layers without acoustic context, i.e. frame-by-frame residual layers. I think this could likely be accomplished simply by using time-stride=0 in the tdnnf layers.
You can share the log files with me if you want, esp. the progress log and/or detailed progress log printed every 10 epochs, where it invokes nnet3-info. Now that I have worked with attention setups in PyTorch I may have some better intuitions. I see now that my intuitions at the time were likely wrong-- I expected most of the benefit, and most of the attention would be on very limited/immediate context, which isn't true, it's much more spread out. And I expected the attention map would be much more "peaky", in reality it tends to be quite spread-out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. Incidentally, there is another reason, I now realize, why our attempts in Kaldi to use attention were not generally that successful. The Kaldi recipes are in an optimization regime where we optimize very fast, in relative terms; and this is enabled by aggressive l2 regularization. This aggressive l2 works because the models' structure is carefully designed to not have problems in this situation; for example, that's why the tdnnf-layer has one of the projections constrained to be orthogonal (otherwise we can "lose" certain subspaces in the bottleneck dim of the tdnnf-layer; they decay to zero).
[The l2 and learning rate are related; you can actually figure out an "effective" learning rate, applicable only for layers followed by batchnorm, from an equation involving l2 and learning rate, I think it's the product of the two or something like that, that matters]
The problem with attention layers is that the key and query matrices are effectively multiplied together before a nonlinearity, i.e. if there's some direction in key/query space where they are both close to zero, the derivatives get close to zero, and they get overwhelmed by the l2 term and can disappear. So we need to be careful with l2 in this case. In Icefall I am actually working with a modified version of l2 that solves this problem, in a spirit similar to the natural gradient implementation in Kaldi, but it would be quite tiresome to implement in Kaldi because the optimizer isn't so easily separable from the layers.
Anyway, it's possible that what might work is to have very little l2, e.g. 1.0e-06 (this might be effectively equivalent to zero)--except for the output layer, where we could have, say, half the l2, at 0.0005 -- and maybe double the final-effective-lrate, because otherwise the parameters will get larger and larger as the model is trained, meaning the effective changes in parameters get smaller and smaller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I vaguely know where the LR vs L2 equation is. I remember tweaking it. Do not remember why tho :( I think it was because I used high dropout proportion, even above the theoretical best 0.5, because it would have been boring.

Do you suggest reducing L2 on all layers below batchnorm, or only the attention layers?

Maybe dropout would be more efficient? At the least, it may hide terms from the L2 attack so they survive longer on average.

And I totally missed nearly all the new development. Had to look up what Icefall was. That's wrong too. Looks like this endless project I got stuck in neck-deep is finally coming to a close.

I never paid attention to the implementation of L2 for natural gradient descent in Kaldi (I should!), but the Fisher manifold is generally non-flat and asymmetric. Interesting, I was thinking just a couple days ago whether the Mahalanobis distance induces the Fisher metric, but got lost trying to derive it. (I finally got a bargain copy of Wheeler and Thorne Gravitation, and it incites weirdly unrelated thoughts in me :) ).

Come think of it, I do not even understand anymore why regularization uses a norm—anything convex everywhere and better without large flat hyperplanes and too many sharp hypercorners, for the lack of a better term, should do, given that the lambda is small. Hyperballs are nice and round and easily differentiable, but that's about it. In a high-D space anything has a lot of unexpected symmetries anyway.

I'm setting up a new cloud cluster, basically because the working one has software that is too old, and problems with Slurm that they at least tried to address in newer versions. I'll need to run something on it, and I do not want just to repeat the old stuff--it can be a good opportunity to explore something new.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress Issue has been taken and is being worked on stale-exclude Stale bot ignore this issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants