-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SelfAttention misses Linear after attention, wrong for Conformer, Transformer #221
Comments
It should not be part of nn.SelfAttention, but definitely be part of the ConformerEncoderLayer and other related layers. |
Why? This might confuse everyone else (everyone who has not used maybe RETURNN |
I think now that I looked at the code the whole naming is flawed right now. In our case "GenericSelfAttention" is not generic at all, but is fixed to a dot attention as far as I can see. our current |
|
Yes, but this is highly misleading. Lets say I now implement an MLP based self-attention instead of a dot based, how would I call that then? And I would expect this should inherit from |
I think self-attention always uses dot product. I have never seen sth else. So this is pretty standard. You can of course always implement sth custom, but then this would be a separate implementation. Why would you expect to inherit from sth else if you would not share any code with the standard self-attention? I don't think this is a problem. Look at any other framework. It's always just called |
I tend to prefer this variant, i.e. having it explicit. The majority of other code seems to have it included, just like the original paper describes it, but there are some exceptions. |
I just wanted to clarify that this is a hidden implication.
If |
Yes but if sth is standard (as dot-product for self-attention), then I would leave it out of the name for simplification. So I would not add "dot" to the name of some self-attention function/module. For a similar reason, I also left out "multi-head". |
This does not say that the names |
Anyway, the naming is somewhat orthogonal (independent, irrelevant) for this specific issue. What about my suggestion? |
Also note, many other frameworks don't actually have self-attention directly as a function/module, but instead have |
This is what I implemented now. I named it |
This fixes the issue here. For other issues (on naming or whatever), if you think there are issues, please open separate issues, maybe referring to the initial discussion from here. |
There is a linear projection after the attention.
ESPNet
MultiHeadedAttention
has it.PyTorch
torch.nn.MultiheadAttention
does not have it.Keras
tf.keras.layers.MultiHeadAttention
has it.torchaudio.models.wav2vec2.components.SelfAttention
has it.Fairseq
MultiheadAttention
has it.Our
nn.GenericSelfAttention
(and thusnn.SelfAttention
) does not have it.The RETURNN
SelfAttentionLayer
also does not have it.But then we also don't have it in
ConformerEncoderLayer
, so it's clearly missing.Also we don't have it in our
Transformer
, so it is missing there as well.So, should we change
nn.GenericSelfAttention
? Always include it? Or optionally include it? Make it a required argument that there is no confusion about it, likeout_dim: Optional[nn.Dim]
(without default). In case the user setsNone
, no linear transformation at the end, otherwise there is.If we don't change
nn.GenericSelfAttention
, we must fix the Transformer and Conformer.The text was updated successfully, but these errors were encountered: