Is It Necessary to Transpose Dimensions in Multi-Head Attention or Can We Reshape Directly? #399

rohanwinsor · 2024-10-12T13:05:46Z

rohanwinsor
Oct 12, 2024

Is it necessary the implementation should be like this

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        ...

    def forward(self, x):
        ...
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)
        ...

over casting them directly to the dimension that we want?

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        ...

    def forward(self, x):
        ...
        keys = keys.view(b, self.num_heads, num_tokens, self.head_dim) 
        values = values.view(b, self.num_heads, num_tokens, self.head_dim)
        queries = queries.view(b, self.num_heads, num_tokens, self.head_dim)
        ...

Got the code from the book as well as from here

Answered by rasbt

Oct 12, 2024

Hey there,

this is a really good question. At first glance, it looks like this should work because the dimensions would be the same. But note that reshaping and transposing are slightly different in terms of how the matrices get arranged for the matrix multiplication that follows. So, no, those are not interchangeable. Actually, there was the same question in #167, where the answer may give a bit more concrete insights.

Anyways, thanks for asking!

View full answer

rasbt · 2024-10-12T14:21:42Z

rasbt
Oct 12, 2024
Maintainer

Hey there,

this is a really good question. At first glance, it looks like this should work because the dimensions would be the same. But note that reshaping and transposing are slightly different in terms of how the matrices get arranged for the matrix multiplication that follows. So, no, those are not interchangeable. Actually, there was the same question in #167, where the answer may give a bit more concrete insights.

Anyways, thanks for asking!

0 replies

rohanwinsor · 2024-10-12T14:29:03Z

rohanwinsor
Oct 12, 2024
Author

Tks a lot for ur answer. Will refer there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is It Necessary to Transpose Dimensions in Multi-Head Attention or Can We Reshape Directly? #399

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Is It Necessary to Transpose Dimensions in Multi-Head Attention or Can We Reshape Directly? #399

rohanwinsor Oct 12, 2024

Is it necessary the implementation should be like this

over casting them directly to the dimension that we want?

Replies: 2 comments

rasbt Oct 12, 2024 Maintainer

rohanwinsor Oct 12, 2024 Author

rohanwinsor
Oct 12, 2024

rasbt
Oct 12, 2024
Maintainer

rohanwinsor
Oct 12, 2024
Author