-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
Add support for BLOOM #331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@zhuohan123 I've fixed the PR to follow our new formatter. It should be ready for review now. Please take a look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for your hard work! Left some of my questions about design choices. In addition, what's the speed difference between similar-size LLaMA and BLOOM?
|
||
float qk = scale * Qk_dot<scalar_t, THREAD_GROUP_SIZE>::dot(q_vecs, k_vecs); | ||
// Add the ALiBi bias if slopes are given. | ||
qk += (alibi_slope != 0) ? alibi_slope * (token_idx - context_len) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition seems unnecessary. If alibi_slope == 0
, then alibi_slope * (token_idx - context_len)
will be 0 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's to avoid the redundant computation of 0 * (token_idx - context_len)
.
@@ -53,13 +55,21 @@ def __init__(self, num_heads: int, head_size: int, scale: float) -> None: | |||
raise ValueError(f"head_size ({self.head_size}) is not supported. " | |||
f"Supported head sizes: {_SUPPORTED_HEAD_SIZES}.") | |||
|
|||
def set_attn_bias(self, input_metadata: InputMetadata) -> None: | |||
if input_metadata.attn_bias: | |||
# Already set by a previous layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you choose this design, instead of explicitly initializing attn_bias
in advance, say at the beginning of the forward
function of BLOOM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. It's because alibi_slopes
is stored in the attention layer. If we want to create the attention bias in BloomForCausalLM
, then we have to store alibi_slopes
in both places (because alibi_slopes
is also used for the decoding attention).
I kinda agree that this design is not ideal. But couldn't find a better way to do so.
@WoosukKwon do you have some benchmarks about speed and memory with BLOOM? |
vllm.utils.is_hpu() was redundant for some time now and has always been problematic particularly for torch.compile mode. Now, we're fully switching to current_platform.is_hpu().
Closes #61
This PR adds the BLOOM model and modifies the paged attention kernel to support ALiBi bias.