-
Notifications
You must be signed in to change notification settings - Fork 12k
metal: template for mat-vec multiplication kernels #2891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
a404dc3
to
507ba74
Compare
507ba74
to
bca5d0c
Compare
M2 Ultra
Strange that I don't observe the big Here are the max threads per kernel:
Not sure how these max threads can be utilized. I think they somehow indicate how "parallel" the kernel is. Or maybe how much registers it uses. |
@ggerganov The maximum number of threads indicates the register pressure. We should avoid using too many of them, but this depends on the workload. I am bothered by the fact that we don't achieve doubled inference speed on Ultra chips. For llama 34B Q4_0, the M1 Max inference speed is ~16 tok/s. One would expect the M2 Ultra to reach ~32 tok/s, but in reality, we only have ~27 tok/s. Would you mind profiling this on the M2 Ultra? We can directly profile a built binary, so there's no need to configure llama.cpp to be an Xcode project. |
I can try. How do I profile the binary? |
5edbba1
to
aa4b7d2
Compare
927b13c
to
35cd10c
Compare
With the new commits, we should expect to see approximately 0 performance loss on the M2 series for Q4_K and other non-k quantizations. If |
Have you noticed how matrix multiplication performance, after reaching peak performance at about pp = 64, goes down with increasing context size? This is the case on this branch and on master. Normally we would expect matrix multiplication performance to increase initially and then saturate when the compute capability limit has been reached. Here some sample results on 30-core M2 Max:
|
@ikawrakow A quick check you can perform is simply adding a @ggerganov |
@lshzh-ww Not sure I understand your reasoning. The model matrices are of a fixed size. The intermediate result matrices (or activations as they call them) grow with prompt size, and we observe performance degradation. I.e., it is exactly the other way around: performance is good for relatively small matrices ( |
These are the calculations we need to conduct for one layer when evaluating 512 tokens. Currently, we use the optimized When I mentioned 'small matrices,' I meant matrices smaller than the model matrices. Regarding the |
With the following patch applied on this branch, I get:
diff --git a/ggml-metal.m b/ggml-metal.m
index fc656fb..47427f1 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -672,6 +672,8 @@ void ggml_metal_graph_compute(
// dst->name);
//}
+ if (dst->op != GGML_OP_MUL_MAT) continue;
+
switch (dst->op) {
case GGML_OP_NONE:
case GGML_OP_RESHAPE:
@@ -810,6 +812,7 @@ void ggml_metal_graph_compute(
[ctx->device supportsFamily:MTLGPUFamilyApple7] &&
ne00%32 == 0 &&
ne11 > 1) {
+ continue;
switch (src0->type) {
case GGML_TYPE_F16: [encoder setComputePipelineState:ctx->pipeline_mul_mm_f16_f32]; break;
case GGML_TYPE_Q4_0: [encoder setComputePipelineState:ctx->pipeline_mul_mm_q4_0_f32]; break;
@@ -870,6 +873,7 @@ void ggml_metal_graph_compute(
}
[encoder dispatchThreadgroups:MTLSizeMake((ne01 + 7)/8, ne11, ne12) threadsPerThreadgroup:MTLSizeMake(64, 1, 1)];
} else {
+ continue;
switch (src0->type) {
case GGML_TYPE_F16: [encoder setComputePipelineState:ctx->pipeline_mul_mat_f16_f32]; break;
default: GGML_ASSERT(false && " not implemented"); If I disable all ops in Metal to measure just the overhead of non-Metal stuff, I get: diff --git a/ggml-metal.m b/ggml-metal.m
index fc656fb..ee2da39 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -672,6 +672,8 @@ void ggml_metal_graph_compute(
// dst->name);
//}
+ continue; If I remove this overhead from the previous results, I get the following BW estimate:
|
New numbers after merge:
|
@ggerganov Thank you for conducting the test! It appears that the M2 Ultra's bandwidth utilization increases as the model size grows larger. Perhaps the 17 GB model is simply too 'tiny' to fully engage it. Good to know this. |
Do you have any idea how to enable my instruments to also show the measured bandwidth like in your Xcode? |
I am sorry, I really have no idea why it didn't work. I'm currently using Sonoma Beta in conjunction with Xcode 15, although I'm uncertain if this might be the cause of the issue. |
I compared this branch (PR) with #2959 (IK) on M2 Ultra after rebasing both on latest
However, I'm looking at the new implementation and I am not really happy with the heavy templateization of the code. Would like to take some more time to review the code and post some more specific comments of what I find concerning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are some general comments about what I like and don't like in this implementation. I know they may sound opinionated and it's fine to disagree.
I'm just worried that going down the "template" road will increase the technical debt in the long run and make the code very hard to understand and modify. For example, when I try to work on the CUDA implementation nowadays (where this type of change has already happened), it is very difficult for me to make a change without breaking everything else. Abstracting things like this and putting them into common patterns will always lead to those kind of problems. We have to be very confident that these kernels are the best possible way to compute before we "bind" their implementations together. Otherwise, we won't be able to iterate and improve the performance, or at the very least it will be exceedingly difficult.
On first look, having less LOCs might always look like a good result. But I think it's more important to consider the cognitive load when reading the code. 1000 lines of straight-forward, repeatable and easily parsable code are better than 100 lines of abstraction with 2 levels of nested indirection in it.
Regarding the upcoming experiments related to quantized matrix multiplication that I believe will further improve the performance:
This commit provides one template for all mat-vec kernels, making it easier for us to implement multiplication kernels between two quantized types in future.
I don't think it will help. It's likely to make it more difficult - we will be trying to fit the new things into the old pattern, instead of being able to freely hack and experiment with new type of implementations that can turn out to be more effective for that purpose.
Hope this does not come as a too hard critique and disappoint you - your contributions have been extremely valuable and I will be happy to see more of them.
But this particular change likely won't make it to master
GGML_METAL_DECL_KERNEL(mul_mv_q3_K_f32); | ||
GGML_METAL_DECL_KERNEL(mul_mv_q4_K_f32); | ||
GGML_METAL_DECL_KERNEL(mul_mv_q5_K_f32); | ||
GGML_METAL_DECL_KERNEL(mul_mv_q6_K_f32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rename is OK
typedef struct { | ||
half d; // delta | ||
int8_t qs[QK8_0]; // quants | ||
} block_q8_0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this here from the top of the file is OK
} | ||
|
||
//====================================== dot products ========================= | ||
|
||
kernel void kernel_mul_mat_q2_K_f32( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't change these kernels. Just rename them to reflect that it is a matrix x vector multiplication
|
||
template<typename block_q, short nl, void (*dequantize_func)(device const block_q *, short, thread float4x4 &)> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type of templateization is not OK
It does reduce the code, but that's the only good thing about it.
The signature is complicated and we can no longer search quantization strings - i.e. searching for "4_0" will skip this template. This is a problem, because very often we want to go through all "4_0" related source code and do some change or compare with other quantizations. This being a template makes the process more difficult by introducing template argument indirection.
My preference is to have a much longer and easily parsable code with few or none template argument indirections.
template<typename block_q_type, int nr, int nsg, int nl, int n_shift, template<typename, typename, typename> class quant_dri> | ||
kernel void kernel_mat_mv(device const void * src0, | ||
device const float * src1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is signature is not OK - I need a few minutes of thinking every time I see it to understand what is the meaning of the template args
template <typename addr_uint16_p,typename addr_block_q_p, typename type4x4> | ||
class q4_0_driver { | ||
public: | ||
uint16_t mask1, mask2, q_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need high-level constructs for this type of computation code.
Having a struct
with some data is fine, but all the functionality that operates on this data has to be imlpemented as plain functions with descriptive names.
Adding a class
with methods here, introduces another level of indirection - when I see a call to inner_product()
, I have to go back and look at the type of class. Instead, when I see for example ggml_vec_dot_q4_0_q8_0()
I immediately know what is going on.
inline float block_q_n_dot_y(device const block_q4_1 * qb_curr, float sumy, thread float * yl, int il) { | ||
float d = qb_curr->d; | ||
float m = qb_curr->m; | ||
device const uint16_t * qs = ((device const uint16_t *)qb_curr + 2 + il/2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good that these kind of hardcoded pointer offsets are avoided in the new version:
device const uint16_t * qs = ((device const uint16_t *)qb_curr + 2 + il/2); | |
device const uint16_t * qs = ((device const uint16_t *)qb_curr->qs + il/2); |
template <typename type4x4> | ||
void dequantize_f16(device const half4x4 * src, short il, thread type4x4 & reg) { | ||
half4x4 temp = *(((device half4x4 *)src)); | ||
for (int i = 0; i < 16; i++){ | ||
reg[i/4][i%4] = temp[i/4][i%4]; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These kind of templates are OK - single argument, reduces code duplication, does not hide the quantization type, easy to parse
@@ -1921,7 +1411,7 @@ kernel void kernel_get_rows( | |||
#define SG_MAT_ROW 8 | |||
|
|||
// each block_q contains 16*nl weights | |||
template<typename block_q, short nl, void (*dequantize_func)(device const block_q *, short, thread half4x4 &)> | |||
template<typename block_q_type, short nl, template<typename, typename, typename> class quant_dri> | |||
kernel void kernel_mul_mm(device const uchar * src0, | |||
device const float * src1, | |||
device float * dst, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The body of this template has to be simplified by factoring out the code that does not depend on the quantization type into separate function calls. We should try to avoid the template by having separate inline implementations of the kernels for each quantization, reusing the common functions and calling the respective dequantize_
functions.
I realized that readability is a top concern to an open source project, where collective collaboration is key. I guessed I may have been a bit too fixated on reducing LOCs when I initially opened on this PR. I also learned that learned that refactoring existing code should not be undertaken without substantial justification and conversation with other people. The primary objective of this PR is to introduce a new function, |
Yes, I think your insight about optimizing the F16 ops in the red rectangles in the plot above was instrumental for the PP speedup.
Yes, no problem with that. I plan to merge #2959 for now, but feel free to update this implementation as you wish. |
I think your previous benchmarks are sufficient. Thank you for your help again! |
This commit provides one template for all mat-vec kernels, making it easier for us to implement multiplication kernels between two quantized types in future.
Also, speedup
Q_K
andF16
inference speed a little bit.M1 Max 32c
QK_K=256
was tested on llama models, andQK_K=64
was tested on openllama models. People are welcome to test this pull request.