-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Closed
Description
I am trying to integrate my experimental model architecture, which uses both alibi embeddings and multi_query_attention. In comparing outputs from the vllm-integrated model and my underlying model, I notice that the generations do not match if alibi and multi_query_attention are both turned on. They do match if I turn off multi_query_attention. With a larger model, I actually run into nans, and the generation fails.
I initialize the attention module as follows:
self.attention = PagedAttentionWithALiBi(num_heads, head_dim, scale=scaling, slopes=slopes, num_kv_heads=1)
I have narrowed down the discrepancy to the single_query_cached_kv_attention call. I don't see any test cases for PagedAttentionWithAlibi, so debugging is a little difficult. Any help will be appreciated!
Metadata
Metadata
Assignees
Labels
No labels