Alibi embeddings and multi_query_attention do not work together

I am trying to integrate my experimental model architecture, which uses both `alibi` embeddings and `multi_query_attention`. In comparing outputs from the vllm-integrated model and my underlying model, I notice that the generations do not match if `alibi` and `multi_query_attention` are both turned on. They do match if I turn off `multi_query_attention`. With a larger model, I actually run into nans, and the generation fails.

I initialize the attention module as follows:

```
self.attention = PagedAttentionWithALiBi(num_heads, head_dim, scale=scaling, slopes=slopes, num_kv_heads=1)
```

I have narrowed down the discrepancy to the `single_query_cached_kv_attention` call. I don't see any test cases for `PagedAttentionWithAlibi`, so debugging is a little difficult. Any help will be appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Alibi embeddings and multi_query_attention do not work together #777

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Alibi embeddings and multi_query_attention do not work together #777

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions