Question
Here again 😄! Seems hard for me to think how to apply attention mask to fast attention, can you please shed some light on that?
I think I should fill some of the Q' and K' to 0 according to the attention_mask, since Q' @ K'.T equals the matrix A, but is that correct?
Question
Here again 😄! Seems hard for me to think how to apply attention mask to fast attention, can you please shed some light on that?
I think I should fill some of the
Q'andK'to 0 according to the attention_mask, sinceQ' @ K'.Tequals the matrixA, but is that correct?