Question about calculation of Q and transpose(K). #10

jaes77 · 2023-04-20T05:40:56Z

Thanks for your effort to make this great platform.

In normal attention, the input of softmax function is a form of matmul(Q,K_T) and its dimension is (batch, num_heads, q_len, k_len)
Also, the attention mask is like a trigonal shape (total shape is could be q_len x k_len)
so, matmul(q, k_t) is masked with the attention mask.

However, I don't understand how matmul(q_chunk, transposed k_chunk) works and results in masked input of softmax compared with original attention algorithm flow at the code lines below.

flash-attention-jax/flash_attention_jax/flash_attention.py

Lines 34 to 37 in 5727815

    
           attn_weights = einsum('i ... d, j ... d -> i ... j', q_scaled, k_chunk) 
        
           key_mask_chunk = rearrange(key_mask_chunk, 'j b -> 1 b 1 j') 
        
           attn_weights = jnp.where(key_mask_chunk, attn_weights, MASK_VALUE)

Can you explain it with details?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about calculation of Q and transpose(K). #10

Question about calculation of Q and transpose(K). #10

jaes77 commented Apr 20, 2023

Question about calculation of Q and transpose(K). #10

Question about calculation of Q and transpose(K). #10

Comments

jaes77 commented Apr 20, 2023