Description
There is a linear projection after the attention.
ESPNet MultiHeadedAttention
has it.
PyTorch torch.nn.MultiheadAttention
does not have it.
Keras tf.keras.layers.MultiHeadAttention
has it.
torchaudio.models.wav2vec2.components.SelfAttention
has it.
Fairseq MultiheadAttention
has it.
Our nn.GenericSelfAttention
(and thus nn.SelfAttention
) does not have it.
The RETURNN SelfAttentionLayer
also does not have it.
But then we also don't have it in ConformerEncoderLayer
, so it's clearly missing.
Also we don't have it in our Transformer
, so it is missing there as well.
So, should we change nn.GenericSelfAttention
? Always include it? Or optionally include it? Make it a required argument that there is no confusion about it, like out_dim: Optional[nn.Dim]
(without default). In case the user sets None
, no linear transformation at the end, otherwise there is.
If we don't change nn.GenericSelfAttention
, we must fix the Transformer and Conformer.