-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : full ALiBi support #7192
Conversation
922a5b3
to
d0592d4
Compare
a4c7cf7
to
166e60b
Compare
ba4d12a
to
97c27f5
Compare
// TODO: is there a better way to handle -INFINITY? | ||
dst_data[i00] = src[0] == -INFINITY ? -MAXHALF : src[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I fully understand the problem, but when we cast the KQ_mask
from F32 to F16 in this kernel, the F32 -INFINIFTY
values are converted to some F16 value that when multiplied with the ALiBi slope results in garbage. Even if I force the slope to be 1.0h
it still produces garbage. I expected that it would still be -INFINITY
, but it's not the case. Since there is no way to print these values in Metal, this is the workaround that I found to work, but it feels a bit poor
|
||
for (int i1 = ir0; i1 < ir1; i1++) { | ||
// ALiBi | ||
const uint32_t h = (i1/ne01)%ne02; // head | ||
const float slope = (max_bias > 0.0f) ? h < n_head_log2 ? powf(m0, h + 1) : powf(m1, 2*(h - n_head_log2) + 1) : 1.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see how this directly work? how does it implement this logic?
[0, 1, 2, 3], [1, 0, 1, 2]
and the negative slope or negativity of this matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We compute the integer matrix here:
Lines 10966 to 10987 in f7055d3
// For causal attention, use only the previous KV cells | |
// of the correct sequence for each token of the batch. | |
// It's assumed that if a token in the batch has multiple sequences, they are equivalent. | |
for (int h = 0; h < 1; ++h) { | |
for (int j = 0; j < n_tokens; ++j) { | |
const llama_pos pos = batch.pos[j]; | |
const llama_seq_id seq_id = batch.seq_id[j][0]; | |
for (int i = 0; i < n_kv; ++i) { | |
float f; | |
if (!lctx.kv_self.cells[i].has_seq_id(seq_id) || lctx.kv_self.cells[i].pos > pos) { | |
f = -INFINITY; | |
} else { | |
if (hparams.use_alibi) { | |
f = -fabs(lctx.kv_self.cells[i].pos - pos); | |
} else { | |
f = 0.0f; | |
} | |
} | |
data[h*(n_kv*n_tokens) + j*n_kv + i] = f; | |
} | |
} |
We store it in KQ_mask
. The slope
is just the head-specific hyper parameter m
from the link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, good thanks!
I think this is ready. @JoanFM, I will now rebase the Jina branch on top of this branch and will adapt to the changes. Will ping you when ready so we can do some tests and verify that this works. If it is good, will mark this PR ready for review and proceed |
Updated the branch in #6826 and my embedding tests using Jina worked correctly, so it is ready for further tests. Let me know if you spot something that is not right |
I have done some tests on my end and seems to work fine |
float scale = 1.0f; | ||
float max_bias = 0.0f; | ||
|
||
memcpy(&scale, (float *) dst->op_params + 0, sizeof(float)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to confirm, this is needed because to set params we also do a memcpy
instead of making op_params
float[] for alignment reasons right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test the CI of SYCL on Intel GPU.
The quality is OK!
soft_max is same as base.
a616605
to
0faf92e
Compare
I think the total amount of work needed for conflict resolution would have been lower if #7188 had been merged first but what's done is done. |
Implementing ALiBI as explained here: https://github.com/ofirpress/attention_with_linear_biases
If I understand correctly, the ALiBi bias can become part of the
KQ_mask
:Therefore there is no need to create the
KQ_pos
tensor as I initially thought. If this is correct, then we can simplify theggml_soft_max_ext()
operator and no longer pass the positions tensor. Extending Flash Attention support should also be possible and simpleThis PR is needed to properly support Jina embedding models: #6826
Worflow
ggml_alibi()
ggml_soft_max_ext()
to no longer acceptpos
tensor:ggml_flash_attn_ext()
to support the new ALiBiKQ_mask
:Tests