Skip to content

metal: template for mat-vec multiplication kernels #2891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

lshzh-ww
Copy link
Contributor

@lshzh-ww lshzh-ww commented Aug 30, 2023

This commit provides one template for all mat-vec kernels, making it easier for us to implement multiplication kernels between two quantized types in future.
Also, speedup Q_K and F16 inference speed a little bit.

M1 Max 32c

model backend threads test master t/s PR t/s diff
llama 30B Q4_0 Metal 4 pp 512 68.46 ± 0.02 67.85 ± 0.07 -0.9%
llama 30B Q2_K Metal 4 pp 512 63.85 ± 0.01 64.20 ± 0.01 0.5%
llama 30B Q3_K_S Metal 4 pp 512 61.02 ± 0.01 60.53 ± 0.01 -0.8%
llama 30B Q4_K_S Metal 4 pp 512 61.58 ± 0.03 62.46 ± 0.01 1.4%
LLaMA2 34B Q5_K_S Metal 4 pp 512 45.97 ± 0.04 47.92 ± 0.01 4.2%
LLaMA2 34B Q6_K Metal 4 pp 512 43.52 ± 0.03 47.74 ± 0.00 9.7%
llama 30B Q4_0 Metal 4 tg 128 16.10 ± 0.01 16.06 ± 0.01 -0.2%
llama 30B Q2_K Metal 4 tg 128 15.41 ± 0.00 17.46 ± 0.00 13.3%
llama 30B Q3_K_S Metal 4 tg 128 11.36 ± 0.00 13.17 ± 0.00 16%
llama 30B Q4_K_S Metal 4 tg 128 14.48 ± 0.01 15.73 ± 0.01 8.6%
LLaMA2 34B Q5_K_S Metal 4 tg 128 9.87 ± 0.00 11.89 ± 0.01 20.4%
LLaMA2 34B Q6_K Metal 4 tg 128 7.95 ± 0.00 10.91 ± 0.01 37.2%
falcon 7B F16 Metal 4 pp 512 154.59 ± 0.53 154.69 ± 0.31 0.0%
falcon 7B F16 Metal 4 tg 128 17.45 ± 0.01 22.83 ± 0.02 30.8%

QK_K=256 was tested on llama models, and QK_K=64 was tested on openllama models. People are welcome to test this pull request.

@lshzh-ww lshzh-ww requested a review from ggerganov August 30, 2023 03:04
@ggerganov ggerganov added high priority Very important issue performance Speed related topics labels Aug 30, 2023
@ggerganov
Copy link
Member

ggerganov commented Aug 30, 2023

M2 Ultra

model size backend test master t/s PR t/s diff
llama2 7B F16 12.55 GiB Metal pp 512 666.13 ± 0.17 665.74 ± 0.20 -0.1%
llama2 7B Q8_0 6.67 GiB Metal pp 512 631.59 ± 0.25 631.45 ± 0.15 0.0%
llama2 7B Q6_K 5.15 GiB Metal pp 512 561.74 ± 0.18 576.44 ± 0.12 2.6%
llama2 7B Q5_K 4.45 GiB Metal pp 512 560.83 ± 0.15 580.31 ± 0.11 3.5%
llama2 7B Q4_K 3.80 GiB Metal pp 512 586.86 ± 0.14 599.41 ± 0.06 2.1%
llama2 7B Q4_1 3.95 GiB Metal pp 512 634.11 ± 0.12 638.90 ± 0.20 0.8%
llama2 7B Q4_0 3.56 GiB Metal pp 512 632.20 ± 0.10 634.55 ± 0.12 0.4%
llama2 7B Q3_K 3.07 GiB Metal pp 512 580.48 ± 0.23 586.09 ± 0.30 1.0%
llama2 7B Q2_K 2.63 GiB Metal pp 512 580.36 ± 0.07 581.80 ± 0.18 0.2%
llama2 7B F16 12.55 GiB Metal tg 64 29.59 ± 0.01 40.53 ± 0.03 37.0%
llama2 7B Q8_0 6.67 GiB Metal tg 64 61.41 ± 0.02 61.10 ± 0.04 -0.5%
llama2 7B Q6_K 5.15 GiB Metal tg 64 68.64 ± 0.03 67.36 ± 0.07 -1.9%
llama2 7B Q5_K 4.45 GiB Metal tg 64 69.29 ± 0.03 69.61 ± 0.10 0.5%
llama2 7B Q4_K 3.80 GiB Metal tg 64 80.36 ± 0.06 77.33 ± 0.10 -3.8%
llama2 7B Q4_1 3.95 GiB Metal tg 64 81.91 ± 0.08 81.58 ± 0.08 -0.4%
llama2 7B Q4_0 3.56 GiB Metal tg 64 86.91 ± 0.16 85.58 ± 0.06 -1.5%
llama2 7B Q3_K 3.07 GiB Metal tg 64 76.98 ± 0.10 76.93 ± 0.10 -0.1%
llama2 7B Q2_K 2.63 GiB Metal tg 64 75.58 ± 0.03 77.57 ± 0.04 2.6%
model size backend test master t/s PR t/s diff
llama2 13B F16 24.24 GiB Metal pp 512 389.75 ± 0.09 389.75 ± 0.09 0.0%
llama2 13B Q8_0 12.88 GiB Metal pp 512 368.67 ± 0.12 368.90 ± 0.14 0.1%
llama2 13B Q6_K 9.95 GiB Metal pp 512 323.95 ± 0.08 333.15 ± 0.07 2.8%
llama2 13B Q5_K 8.60 GiB Metal pp 512 320.93 ± 0.05 334.44 ± 0.07 4.2%
llama2 13B Q4_K 7.33 GiB Metal pp 512 340.08 ± 0.08 347.71 ± 0.10 2.2%
llama2 13B Q4_1 7.61 GiB Metal pp 512 370.19 ± 0.04 373.28 ± 0.07 0.8%
llama2 13B Q4_0 6.86 GiB Metal pp 512 369.04 ± 0.11 370.50 ± 0.12 0.4%
llama2 13B Q3_K 5.90 GiB Metal pp 512 335.71 ± 0.08 339.16 ± 0.07 1.0%
llama2 13B Q2_K 5.06 GiB Metal pp 512 336.00 ± 0.07 336.59 ± 0.07 0.2%
llama2 13B F16 24.24 GiB Metal tg 64 16.50 ± 0.02 22.66 ± 0.03 37.3%
llama2 13B Q8_0 12.88 GiB Metal tg 64 36.59 ± 0.02 37.11 ± 0.02 1.4%
llama2 13B Q6_K 9.95 GiB Metal tg 64 41.73 ± 0.01 42.24 ± 0.03 1.2%
llama2 13B Q5_K 8.60 GiB Metal tg 64 42.74 ± 0.01 44.10 ± 0.01 3.2%
llama2 13B Q4_K 7.33 GiB Metal tg 64 49.54 ± 0.06 49.43 ± 0.01 -0.2%
llama2 13B Q4_1 7.61 GiB Metal tg 64 50.99 ± 0.02 52.13 ± 0.09 2.2%
llama2 13B Q4_0 6.86 GiB Metal tg 64 54.77 ± 0.03 54.87 ± 0.02 0.2%
llama2 13B Q3_K 5.90 GiB Metal tg 64 46.67 ± 0.05 49.38 ± 0.05 5.8%
llama2 13B Q2_K 5.06 GiB Metal tg 64 47.99 ± 0.05 50.10 ± 0.02 4.4%
model size backend test master t/s PR t/s diff
Falcon 7B F16 13.44 GiB Metal pp 512 402.47 ± 2.15 402.60 ± 2.03 0.0%
Falcon 7B Q8_0 7.14 GiB Metal pp 512 390.22 ± 2.33 390.41 ± 1.90 0.0%
Falcon 7B Q4_0 3.92 GiB Metal pp 512 390.37 ± 2.21 391.00 ± 2.45 0.2%
Falcon 7B F16 13.44 GiB Metal tg 64 29.43 ± 0.01 38.25 ± 0.05 30.0%
Falcon 7B Q8_0 7.14 GiB Metal tg 64 59.90 ± 0.02 60.91 ± 0.03 1.7%
Falcon 7B Q4_0 3.92 GiB Metal tg 64 86.01 ± 0.06 85.60 ± 0.04 -0.5%
model size backend test master t/s PR t/s diff
codellama 34B F16 62.85 GiB Metal pp 512 149.33 ± 0.03 149.36 ± 0.03 0.0%
codellama 34B Q8_0 33.39 GiB Metal pp 512 141.07 ± 0.02 141.13 ± 0.02 0.0%
codellama 34B Q6_K 25.78 GiB Metal pp 512 123.82 ± 0.02 127.67 ± 0.02 3.1%
codellama 34B Q5_K_M 22.20 GiB Metal pp 512 123.62 ± 0.02 128.60 ± 0.02 4.0%
codellama 34B Q5_K_S 21.64 GiB Metal pp 512 123.50 ± 0.01 128.74 ± 0.01 4.2%
codellama 34B Q4_K_M 18.83 GiB Metal pp 512 130.63 ± 0.01 133.59 ± 0.03 2.3%
codellama 34B Q4_K_S 17.83 GiB Metal pp 512 131.58 ± 0.02 134.42 ± 0.02 2.2%
codellama 34B Q4_1 19.69 GiB Metal pp 512 142.03 ± 0.01 143.23 ± 0.02 0.8%
codellama 34B Q4_0 17.74 GiB Metal pp 512 141.57 ± 0.03 142.10 ± 0.03 0.4%
codellama 34B Q3_K_M 15.16 GiB Metal pp 512 128.39 ± 0.01 129.75 ± 0.02 1.1%
codellama 34B Q3_K_S 13.60 GiB Metal pp 512 126.79 ± 0.02 127.25 ± 0.02 0.4%
codellama 34B Q2_K 13.23 GiB Metal pp 512 128.11 ± 0.02 128.46 ± 0.01 0.3%
codellama 34B F16 62.85 GiB Metal tg 64 7.32 ± 0.00 9.88 ± 0.00 35.0%
codellama 34B Q8_0 33.39 GiB Metal tg 64 16.78 ± 0.01 16.75 ± 0.00 -0.2%
codellama 34B Q6_K 25.78 GiB Metal tg 64 19.09 ± 0.01 20.01 ± 0.00 4.8%
codellama 34B Q5_K_M 22.20 GiB Metal tg 64 20.87 ± 0.01 21.22 ± 0.00 1.7%
codellama 34B Q5_K_S 21.64 GiB Metal tg 64 21.21 ± 0.01 21.68 ± 0.00 2.2%
codellama 34B Q4_K_M 18.83 GiB Metal tg 64 25.24 ± 0.01 24.45 ± 0.01 -3.1%
codellama 34B Q4_K_S 17.83 GiB Metal tg 64 26.37 ± 0.01 25.51 ± 0.00 -3.3%
codellama 34B Q4_1 19.69 GiB Metal tg 64 25.69 ± 0.01 25.36 ± 0.01 -1.3%
codellama 34B Q4_0 17.74 GiB Metal tg 64 27.76 ± 0.01 27.23 ± 0.01 -1.9%
codellama 34B Q3_K_M 15.16 GiB Metal tg 64 23.74 ± 0.02 24.50 ± 0.01 3.2%
codellama 34B Q3_K_S 13.60 GiB Metal tg 64 22.84 ± 0.01 24.20 ± 0.02 6.0%
codellama 34B Q2_K 13.23 GiB Metal tg 64 23.42 ± 0.01 24.87 ± 0.01 6.2%

Strange that I don't observe the big Q5_K_S and Q6_K TG speedups for 34B

Here are the max threads per kernel:

kernel max threads (master) max threads (PR)
kernel_mul_mat_f16_f32 1024 1024
kernel_mul_mv_f16_f32 - 1024
kernel_mul_mv_q4_0_f32 896 1024
kernel_mul_mv_q4_1_f32 896 1024
kernel_mul_mv_q8_0_f32 768 832
kernel_mul_mv_q2_K_f32 640 832
kernel_mul_mv_q3_K_f32 704 640
kernel_mul_mv_q4_K_f32 576 576
kernel_mul_mv_q5_K_f32 576 576
kernel_mul_mv_q6_K_f32 1024 640
kernel_mul_mm_f16_f32 768 768
kernel_mul_mm_q4_0_f32 768 768
kernel_mul_mm_q8_0_f32 768 768
kernel_mul_mm_q4_1_f32 768 768
kernel_mul_mm_q2_K_f32 768 768
kernel_mul_mm_q3_K_f32 768 768
kernel_mul_mm_q4_K_f32 768 768
kernel_mul_mm_q5_K_f32 704 768
kernel_mul_mm_q6_K_f32 704 768

Not sure how these max threads can be utilized. I think they somehow indicate how "parallel" the kernel is. Or maybe how much registers it uses.

@lshzh-ww
Copy link
Contributor Author

@ggerganov
This template reduces the pressure on the memory controller by first loading whole blocks into threadgroup memory. Then, it allows threads to read from the threadgroup, instead of letting threads directly issue multiple loads to device memory. It appears that the M1 series has a weaker memory controller than the M2 series, so the M1 series benefits more from this commit.

The maximum number of threads indicates the register pressure. We should avoid using too many of them, but this depends on the workload.

I am bothered by the fact that we don't achieve doubled inference speed on Ultra chips. For llama 34B Q4_0, the M1 Max inference speed is ~16 tok/s. One would expect the M2 Ultra to reach ~32 tok/s, but in reality, we only have ~27 tok/s. Would you mind profiling this on the M2 Ultra? We can directly profile a built binary, so there's no need to configure llama.cpp to be an Xcode project.

@ggerganov
Copy link
Member

I can try. How do I profile the binary?

@lshzh-ww
Copy link
Contributor Author

You need to have Xcode downloaded. Open Xcode, and from the Xcode menu click Open Developer Tool->Instruments. Choose Game Performancetemplate.

Screenshot 2023-08-30 at 15 14 36

Then you can click Your Mac and Choose target....

Screenshot 2023-08-30 at 15 29 09

Here I also set arguments for llama.cpp.

Screenshot 2023-08-30 at 15 30 39

At last, open File menu and click Recording options. Please select Performance Limiters in the Counter Set and Performance State to be Maximum. Then Record.

Screenshot 2023-08-30 at 15 17 29

Once you finish running, you can see performance stats for GPU. Currently, for 33B Q4_0 model peak memory read bandwidth is ~385 GB/s.
Screenshot 2023-08-30 at 15 38 14

@ggerganov
Copy link
Member

image

It's empty:

image

I see the GPU utilized in Activity Monitor. Not sure why Xcode does not show read bandwidth

Here is the "Target" view:

image

Need to logoff - will continue tomorrow

@lshzh-ww
Copy link
Contributor Author

lshzh-ww commented Aug 31, 2023

@ggerganov

With the new commits, we should expect to see approximately 0 performance loss on the M2 series for Q4_K and other non-k quantizations.

If Instruments didn't work on M2 Ultra, we can alternatively attempt to only keep the MUL_MAT operations in the graph and estimate memory read bandwidth through token generation speed. Please ensure to also remove the pipeline_mul_mat_f16_f32 kernel. I achieved around 53 ms/token on a 33B Q4_0 model when only conducting quantized matrix-vector multiplications.

@ikawrakow
Copy link
Contributor

Have you noticed how matrix multiplication performance, after reaching peak performance at about pp = 64, goes down with increasing context size? This is the case on this branch and on master. Normally we would expect matrix multiplication performance to increase initially and then saturate when the compute capability limit has been reached. Here some sample results on 30-core M2 Max:

model size params backend ngl test t/s
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 100 pp 32 373.07 ± 1.29
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 100 pp 64 430.14 ± 0.78
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 100 pp 128 410.06 ± 0.31
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 100 pp 256 354.64 ± 0.13
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 100 pp 512 267.40 ± 0.07
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 100 pp 1024 196.92 ± 0.06

@lshzh-ww
Copy link
Contributor Author

lshzh-ww commented Aug 31, 2023

@ikawrakow
I suspect this is due to not having a good implementation for small matrix multiplications. Currently, we have kernel_mul_mv_* for large matrix-vector multiplication and kernel_mul_mm_* for large matrix-matrix multiplication. We should rewrite kernel_mul_mat_f16_f32 to optimize it for relatively small matrices (when calculating KQ and KQV).

A quick check you can perform is simply adding a break; statement at line 872 of ggml-metal.m in this branch to skip all the kernel_mul_mat_f16_f32 kernels. You will observe a much, much-improved prompt processing speed.

@ggerganov
I think this also answers #2850. EDIT: Partly. Profiling shows that kernel_mul_mm_* runs with nearly the same efficiency on Falcon as on Llama. Therefore, apart from kernel_mul_mat_f16_f32, there might be other kernel slowdowns prompting evaluation on Falcon.

@ikawrakow
Copy link
Contributor

@lshzh-ww Not sure I understand your reasoning. The model matrices are of a fixed size. The intermediate result matrices (or activations as they call them) grow with prompt size, and we observe performance degradation. I.e., it is exactly the other way around: performance is good for relatively small matrices (pp = 64) and bad for larger matrices (pp = 512).

@lshzh-ww
Copy link
Contributor Author

image
(Compute graph created by ggerganov. )

These are the calculations we need to conduct for one layer when evaluating 512 tokens. Currently, we use the optimized kernel_mul_mm_* for nodes in the blue box, and we use the unoptimized kernel_mul_mat_f16_f32 for nodes in the red box. It is this part of the calculations that slows down the entire prompt evaluation. Because this kernel calculates in a straightforward way, the larger the matrices (which are still much smaller than those kernel_mul_mm_* is designed for), the less efficient it becomes.

When I mentioned 'small matrices,' I meant matrices smaller than the model matrices. Regarding the kernel_mul_mat_f16_f32, you are correct: the larger the matrices, the slower it performs. I apologize for any confusion.

@ggerganov
Copy link
Member

ggerganov commented Sep 1, 2023

@lshzh-ww

With the following patch applied on this branch, I get:

model size backend th test t/s BW
LLaMA 30B F16 60.59 GiB Metal 4 tg 128 10.74 ± 0.01 698.7 GB/s
LLaMA 30B Q8_0 32.19 GiB Metal 4 tg 128 18.84 ± 0.01 651.2 GB/s
LLaMA 30B Q4_0 17.09 GiB Metal 4 tg 128 32.37 ± 0.02 594.0 GB/s
diff --git a/ggml-metal.m b/ggml-metal.m
index fc656fb..47427f1 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -672,6 +672,8 @@ void ggml_metal_graph_compute(
                 //            dst->name);
                 //}
 
+                if (dst->op != GGML_OP_MUL_MAT) continue;
+
                 switch (dst->op) {
                     case GGML_OP_NONE:
                     case GGML_OP_RESHAPE:
@@ -810,6 +812,7 @@ void ggml_metal_graph_compute(
                                 [ctx->device supportsFamily:MTLGPUFamilyApple7] &&
                                 ne00%32 == 0 &&
                                 ne11 > 1) {
+                                continue;
                                 switch (src0->type) {
                                     case GGML_TYPE_F16:  [encoder setComputePipelineState:ctx->pipeline_mul_mm_f16_f32];  break;
                                     case GGML_TYPE_Q4_0: [encoder setComputePipelineState:ctx->pipeline_mul_mm_q4_0_f32]; break;
@@ -870,6 +873,7 @@ void ggml_metal_graph_compute(
                                 }
                                 [encoder dispatchThreadgroups:MTLSizeMake((ne01 + 7)/8, ne11, ne12) threadsPerThreadgroup:MTLSizeMake(64, 1, 1)];
                             } else {
+                                continue;
                                 switch (src0->type) {
                                     case GGML_TYPE_F16:  [encoder setComputePipelineState:ctx->pipeline_mul_mat_f16_f32];  break;
                                     default: GGML_ASSERT(false && " not implemented");

If I disable all ops in Metal to measure just the overhead of non-Metal stuff, I get: 2.05 ms/t

diff --git a/ggml-metal.m b/ggml-metal.m
index fc656fb..ee2da39 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -672,6 +672,8 @@ void ggml_metal_graph_compute(
                 //            dst->name);
                 //}
 
+                continue;

If I remove this overhead from the previous results, I get the following BW estimate:

model size backend th test mul_ mat_mv t/s BW
LLaMA 30B F16 60.59 GiB Metal 4 tg 128 10.98 714.3 GB/s
LLaMA 30B Q8_0 32.19 GiB Metal 4 tg 128 19.60 677.5 GB/s
LLaMA 30B Q4_0 17.09 GiB Metal 4 tg 128 34.67 636.2 GB/s

@ggerganov
Copy link
Member

ggerganov commented Sep 1, 2023

New numbers after merge:

model size params backend test t/s
llama2 7B F16 12.55 GiB 6.74 B Metal pp 512 766.33 ± 0.16
llama2 7B Q8_0 6.67 GiB 6.74 B Metal pp 512 721.86 ± 0.30
llama2 7B Q6_K 5.15 GiB 6.74 B Metal pp 512 650.67 ± 0.11
llama2 7B Q5_K 4.45 GiB 6.74 B Metal pp 512 655.52 ± 0.19
llama2 7B Q4_K 3.80 GiB 6.74 B Metal pp 512 679.89 ± 0.34
llama2 7B Q4_1 3.95 GiB 6.74 B Metal pp 512 730.89 ± 0.20
llama2 7B Q4_0 3.56 GiB 6.74 B Metal pp 512 725.72 ± 0.13
llama2 7B Q3_K 3.07 GiB 6.74 B Metal pp 512 662.73 ± 0.12
llama2 7B Q2_K 2.63 GiB 6.74 B Metal pp 512 657.82 ± 0.09
llama2 7B F16 12.55 GiB 6.74 B Metal tg 64 41.30 ± 0.01
llama2 7B Q8_0 6.67 GiB 6.74 B Metal tg 64 61.41 ± 0.02
llama2 7B Q6_K 5.15 GiB 6.74 B Metal tg 64 67.33 ± 0.08
llama2 7B Q5_K 4.45 GiB 6.74 B Metal tg 64 69.71 ± 0.04
llama2 7B Q4_K 3.80 GiB 6.74 B Metal tg 64 78.63 ± 0.07
llama2 7B Q4_1 3.95 GiB 6.74 B Metal tg 64 82.99 ± 0.07
llama2 7B Q4_0 3.56 GiB 6.74 B Metal tg 64 87.00 ± 0.05
llama2 7B Q3_K 3.07 GiB 6.74 B Metal tg 64 77.33 ± 0.04
llama2 7B Q2_K 2.63 GiB 6.74 B Metal tg 64 77.50 ± 0.05
model size params backend ngl test t/s
codellama 34B mostly F16 62.85 GiB 33.74 B Metal 999 pp 512 188.12 ± 0.03
codellama 34B mostly Q8_0 33.39 GiB 33.74 B Metal 999 pp 512 175.48 ± 0.04
codellama 34B mostly Q6_K 25.78 GiB 33.74 B Metal 999 pp 512 154.96 ± 0.04
codellama 34B mostly Q5_K - Medium 22.20 GiB 33.74 B Metal 999 pp 512 156.34 ± 0.03
codellama 34B mostly Q4_K - Medium 18.83 GiB 33.74 B Metal 999 pp 512 163.70 ± 0.02
codellama 34B mostly Q4_1 19.69 GiB 33.74 B Metal 999 pp 512 178.50 ± 0.04
codellama 34B mostly Q4_0 17.74 GiB 33.74 B Metal 999 pp 512 176.81 ± 0.03
codellama 34B mostly Q3_K - Medium 15.16 GiB 33.74 B Metal 999 pp 512 158.06 ± 0.02
codellama 34B mostly Q2_K 13.23 GiB 33.74 B Metal 999 pp 512 156.17 ± 0.02
codellama 34B mostly F16 62.85 GiB 33.74 B Metal 999 tg 64 9.98 ± 0.01
codellama 34B mostly Q8_0 33.39 GiB 33.74 B Metal 999 tg 64 16.84 ± 0.01
codellama 34B mostly Q6_K 25.78 GiB 33.74 B Metal 999 tg 64 20.10 ± 0.00
codellama 34B mostly Q5_K - Medium 22.20 GiB 33.74 B Metal 999 tg 64 21.35 ± 0.01
codellama 34B mostly Q4_K - Medium 18.83 GiB 33.74 B Metal 999 tg 64 25.18 ± 0.01
codellama 34B mostly Q4_1 19.69 GiB 33.74 B Metal 999 tg 64 26.00 ± 0.01
codellama 34B mostly Q4_0 17.74 GiB 33.74 B Metal 999 tg 64 28.02 ± 0.01
codellama 34B mostly Q3_K - Medium 15.16 GiB 33.74 B Metal 999 tg 64 24.64 ± 0.01
codellama 34B mostly Q2_K 13.23 GiB 33.74 B Metal 999 tg 64 25.03 ± 0.01

@lshzh-ww
Copy link
Contributor Author

lshzh-ww commented Sep 1, 2023

@ggerganov Thank you for conducting the test! It appears that the M2 Ultra's bandwidth utilization increases as the model size grows larger. Perhaps the 17 GB model is simply too 'tiny' to fully engage it. Good to know this.

@ggerganov
Copy link
Member

Do you have any idea how to enable my instruments to also show the measured bandwidth like in your Xcode?

@lshzh-ww
Copy link
Contributor Author

lshzh-ww commented Sep 1, 2023

I am sorry, I really have no idea why it didn't work. I'm currently using Sonoma Beta in conjunction with Xcode 15, although I'm uncertain if this might be the cause of the issue.

@ggerganov
Copy link
Member

ggerganov commented Sep 2, 2023

I compared this branch (PR) with #2959 (IK) on M2 Ultra after rebasing both on latest master:

model size test master t/s IK t/s PR t/s speedup
llama2 7B F16 12.55 GiB pp 512 762.69 ± 0.21 1097.17 ± 4.26 763.22 ± 0.36 0.696
llama2 7B Q8_0 6.67 GiB pp 512 718.36 ± 0.22 1010.57 ± 0.35 718.62 ± 0.52 0.711
llama2 7B Q6_K 5.15 GiB pp 512 629.40 ± 0.39 843.50 ± 0.32 647.82 ± 0.25 0.768
llama2 7B Q5_K_M 4.45 GiB pp 512 629.98 ± 1.00 841.56 ± 0.56 653.89 ± 1.19 0.777
llama2 7B Q5_K_S 4.33 GiB pp 512 630.31 ± 0.17 841.35 ± 0.36 656.29 ± 0.16 0.780
llama2 7B Q4_K_M 3.80 GiB pp 512 663.84 ± 0.30 902.61 ± 0.51 680.03 ± 0.51 0.753
llama2 7B Q4_K_S 3.59 GiB pp 512 669.04 ± 0.28 911.13 ± 0.61 684.99 ± 0.14 0.752
llama2 7B Q4_1 3.95 GiB pp 512 724.63 ± 0.25 1018.29 ± 0.63 730.77 ± 0.34 0.718
llama2 7B Q4_0 3.56 GiB pp 512 722.54 ± 0.12 1013.09 ± 1.23 725.25 ± 0.40 0.716
llama2 7B Q3_K_M 3.07 GiB pp 512 656.13 ± 0.19 886.12 ± 0.48 663.30 ± 0.17 0.749
llama2 7B Q3_K_S 2.75 GiB pp 512 646.84 ± 0.14 870.25 ± 0.76 649.29 ± 0.34 0.746
llama2 7B Q2_K 2.63 GiB pp 512 655.79 ± 0.13 886.24 ± 0.14 657.04 ± 0.45 0.741
llama2 7B F16 12.55 GiB tg 64 18.64 ± 0.01 20.08 ± 0.02 41.40 ± 0.01 2.062
llama2 7B Q8_0 6.67 GiB tg 64 61.47 ± 0.02 61.80 ± 0.03 61.62 ± 0.02 0.997
llama2 7B Q6_K 5.15 GiB tg 64 68.93 ± 0.03 69.29 ± 0.01 67.71 ± 0.03 0.977
llama2 7B Q5_K_M 4.45 GiB tg 64 69.52 ± 0.06 69.97 ± 0.02 69.66 ± 0.03 0.996
llama2 7B Q5_K_S 4.33 GiB tg 64 70.48 ± 0.05 70.99 ± 0.01 71.66 ± 0.07 1.009
llama2 7B Q4_K_M 3.80 GiB tg 64 80.61 ± 0.06 81.08 ± 0.04 78.97 ± 0.06 0.974
llama2 7B Q4_K_S 3.59 GiB tg 64 83.15 ± 0.04 84.08 ± 0.05 82.82 ± 0.11 0.985
llama2 7B Q4_1 3.95 GiB tg 64 82.30 ± 0.05 82.70 ± 0.05 83.25 ± 0.04 1.007
llama2 7B Q4_0 3.56 GiB tg 64 86.98 ± 0.06 87.59 ± 0.07 87.18 ± 0.07 0.995
llama2 7B Q3_K_M 3.07 GiB tg 64 77.31 ± 0.05 78.10 ± 0.04 77.20 ± 0.11 0.988
llama2 7B Q3_K_S 2.75 GiB tg 64 74.87 ± 0.02 75.21 ± 0.03 76.26 ± 0.05 1.014
llama2 7B Q2_K 2.63 GiB tg 64 75.78 ± 0.07 76.43 ± 0.12 77.57 ± 0.11 1.015
model size test master t/s IK t/s PR t/s speedup
llama2 13B F16 24.24 GiB pp 512 441.69 ± 0.18 609.11 ± 0.44 441.45 ± 0.46 0.725
llama2 13B Q8_0 12.88 GiB pp 512 414.46 ± 0.26 559.76 ± 1.37 416.57 ± 0.12 0.744
llama2 13B Q6_K 9.95 GiB pp 512 359.83 ± 0.19 463.36 ± 0.18 371.47 ± 0.06 0.802
llama2 13B Q5_K 8.60 GiB pp 512 356.78 ± 0.13 457.70 ± 0.12 373.40 ± 0.13 0.816
llama2 13B Q4_K 7.33 GiB pp 512 380.05 ± 0.19 497.48 ± 0.25 389.86 ± 0.14 0.784
llama2 13B Q4_1 7.61 GiB pp 512 418.45 ± 0.14 564.94 ± 0.16 422.28 ± 0.26 0.747
llama2 13B Q4_0 6.86 GiB pp 512 416.72 ± 0.19 561.99 ± 0.29 418.70 ± 0.13 0.745
llama2 13B Q3_K 5.90 GiB pp 512 374.65 ± 0.12 488.21 ± 0.14 378.74 ± 0.13 0.776
llama2 13B Q2_K 5.06 GiB pp 512 374.98 ± 0.05 488.60 ± 0.22 375.74 ± 0.06 0.769
llama2 13B F16 24.24 GiB tg 64 10.62 ± 0.01 11.39 ± 0.01 23.39 ± 0.00 2.054
llama2 13B Q8_0 12.88 GiB tg 64 36.72 ± 0.01 36.74 ± 0.01 37.03 ± 0.01 1.008
llama2 13B Q6_K 9.95 GiB tg 64 41.82 ± 0.03 41.97 ± 0.02 42.37 ± 0.02 1.010
llama2 13B Q5_K 8.60 GiB tg 64 42.87 ± 0.01 43.14 ± 0.01 44.15 ± 0.02 1.023
llama2 13B Q4_K 7.33 GiB tg 64 49.70 ± 0.02 50.16 ± 0.02 51.05 ± 0.03 1.018
llama2 13B Q4_1 7.61 GiB tg 64 51.21 ± 0.02 51.46 ± 0.02 51.95 ± 0.04 1.010
llama2 13B Q4_0 6.86 GiB tg 64 55.08 ± 0.01 55.21 ± 0.03 55.19 ± 0.03 1.000
llama2 13B Q3_K 5.90 GiB tg 64 46.82 ± 0.03 47.50 ± 0.03 50.08 ± 0.03 1.054
llama2 13B Q2_K 5.06 GiB tg 64 48.17 ± 0.02 48.36 ± 0.03 50.28 ± 0.03 1.040
model size test master t/s IK t/s PR t/s speedup
LLaMA 7B Q4_0 3.56 GiB pp 32 701.75 ± 4.36 712.61 ± 2.79 704.11 ± 2.12 0.988
LLaMA 7B Q4_0 3.56 GiB pp 64 879.19 ± 3.16 919.36 ± 5.79 883.43 ± 1.79 0.961
LLaMA 7B Q4_0 3.56 GiB pp 128 964.09 ± 1.39 1060.88 ± 1.97 971.69 ± 1.38 0.916
LLaMA 7B Q4_0 3.56 GiB pp 256 888.98 ± 1.02 1073.01 ± 1.69 895.68 ± 0.71 0.835
LLaMA 7B Q4_0 3.56 GiB pp 512 719.63 ± 0.23 1009.82 ± 0.96 725.90 ± 0.19 0.719
LLaMA 7B Q4_0 3.56 GiB pp 1024 549.62 ± 0.86 844.49 ± 1.62 552.02 ± 0.27 0.654
LLaMA 7B Q4_0 3.56 GiB tg 128 87.45 ± 0.10 87.65 ± 0.17 87.16 ± 0.12 0.994

Excluding the very nice F16 TG speedup of a factor of more than x2, there is a modest improvement across the board.

However, I'm looking at the new implementation and I am not really happy with the heavy templateization of the code.
I find it much more difficult to read and to understand and I have strong doubts about merging it.

Would like to take some more time to review the code and post some more specific comments of what I find concerning

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some general comments about what I like and don't like in this implementation. I know they may sound opinionated and it's fine to disagree.

I'm just worried that going down the "template" road will increase the technical debt in the long run and make the code very hard to understand and modify. For example, when I try to work on the CUDA implementation nowadays (where this type of change has already happened), it is very difficult for me to make a change without breaking everything else. Abstracting things like this and putting them into common patterns will always lead to those kind of problems. We have to be very confident that these kernels are the best possible way to compute before we "bind" their implementations together. Otherwise, we won't be able to iterate and improve the performance, or at the very least it will be exceedingly difficult.

On first look, having less LOCs might always look like a good result. But I think it's more important to consider the cognitive load when reading the code. 1000 lines of straight-forward, repeatable and easily parsable code are better than 100 lines of abstraction with 2 levels of nested indirection in it.

Regarding the upcoming experiments related to quantized matrix multiplication that I believe will further improve the performance:

This commit provides one template for all mat-vec kernels, making it easier for us to implement multiplication kernels between two quantized types in future.

I don't think it will help. It's likely to make it more difficult - we will be trying to fit the new things into the old pattern, instead of being able to freely hack and experiment with new type of implementations that can turn out to be more effective for that purpose.

Hope this does not come as a too hard critique and disappoint you - your contributions have been extremely valuable and I will be happy to see more of them.
But this particular change likely won't make it to master

GGML_METAL_DECL_KERNEL(mul_mv_q3_K_f32);
GGML_METAL_DECL_KERNEL(mul_mv_q4_K_f32);
GGML_METAL_DECL_KERNEL(mul_mv_q5_K_f32);
GGML_METAL_DECL_KERNEL(mul_mv_q6_K_f32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rename is OK

typedef struct {
half d; // delta
int8_t qs[QK8_0]; // quants
} block_q8_0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this here from the top of the file is OK

}

//====================================== dot products =========================

kernel void kernel_mul_mat_q2_K_f32(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't change these kernels. Just rename them to reflect that it is a matrix x vector multiplication


template<typename block_q, short nl, void (*dequantize_func)(device const block_q *, short, thread float4x4 &)>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of templateization is not OK

It does reduce the code, but that's the only good thing about it.
The signature is complicated and we can no longer search quantization strings - i.e. searching for "4_0" will skip this template. This is a problem, because very often we want to go through all "4_0" related source code and do some change or compare with other quantizations. This being a template makes the process more difficult by introducing template argument indirection.

My preference is to have a much longer and easily parsable code with few or none template argument indirections.

Comment on lines +1266 to +1268
template<typename block_q_type, int nr, int nsg, int nl, int n_shift, template<typename, typename, typename> class quant_dri>
kernel void kernel_mat_mv(device const void * src0,
device const float * src1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is signature is not OK - I need a few minutes of thinking every time I see it to understand what is the meaning of the template args

Comment on lines +727 to +730
template <typename addr_uint16_p,typename addr_block_q_p, typename type4x4>
class q4_0_driver {
public:
uint16_t mask1, mask2, q_offset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need high-level constructs for this type of computation code.

Having a struct with some data is fine, but all the functionality that operates on this data has to be imlpemented as plain functions with descriptive names.
Adding a class with methods here, introduces another level of indirection - when I see a call to inner_product(), I have to go back and look at the type of class. Instead, when I see for example ggml_vec_dot_q4_0_q8_0() I immediately know what is going on.

inline float block_q_n_dot_y(device const block_q4_1 * qb_curr, float sumy, thread float * yl, int il) {
float d = qb_curr->d;
float m = qb_curr->m;
device const uint16_t * qs = ((device const uint16_t *)qb_curr + 2 + il/2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good that these kind of hardcoded pointer offsets are avoided in the new version:

Suggested change
device const uint16_t * qs = ((device const uint16_t *)qb_curr + 2 + il/2);
device const uint16_t * qs = ((device const uint16_t *)qb_curr->qs + il/2);

Comment on lines -1697 to -1703
template <typename type4x4>
void dequantize_f16(device const half4x4 * src, short il, thread type4x4 & reg) {
half4x4 temp = *(((device half4x4 *)src));
for (int i = 0; i < 16; i++){
reg[i/4][i%4] = temp[i/4][i%4];
}
}
Copy link
Member

@ggerganov ggerganov Sep 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kind of templates are OK - single argument, reduces code duplication, does not hide the quantization type, easy to parse

@@ -1921,7 +1411,7 @@ kernel void kernel_get_rows(
#define SG_MAT_ROW 8

// each block_q contains 16*nl weights
template<typename block_q, short nl, void (*dequantize_func)(device const block_q *, short, thread half4x4 &)>
template<typename block_q_type, short nl, template<typename, typename, typename> class quant_dri>
kernel void kernel_mul_mm(device const uchar * src0,
device const float * src1,
device float * dst,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The body of this template has to be simplified by factoring out the code that does not depend on the quantization type into separate function calls. We should try to avoid the template by having separate inline implementations of the kernels for each quantization, reusing the common functions and calling the respective dequantize_ functions.

@lshzh-ww
Copy link
Contributor Author

lshzh-ww commented Sep 3, 2023

@ggerganov

I realized that readability is a top concern to an open source project, where collective collaboration is key. I guessed I may have been a bit too fixated on reducing LOCs when I initially opened on this PR. I also learned that learned that refactoring existing code should not be undertaken without substantial justification and conversation with other people.

The primary objective of this PR is to introduce a new function, kernel_mul_mv_f16_f32, which will enable us to specialize the kernel_mul_mat_f16_f32 function for performing permuted matrix multiplications. Given that this has already been implemented in PR #2959, I don't see an immediate urgency in merging this PR. If it's acceptable, my plan is to mark this PR as draft and later submit a series of PRs for Q_K optimizations, with the aim of minimizing dramatic changes to the entire codebase. The current Q_K kernels are already performing well, as shown by benchmark results on your M2 Ultra. It's just M1's memory controller needs some extra help.

@ggerganov
Copy link
Member

The primary objective of this PR is to introduce a new function, kernel_mul_mv_f16_f32, which will enable us to specialize the kernel_mul_mat_f16_f32 function for performing permuted matrix multiplications.

Yes, I think your insight about optimizing the F16 ops in the red rectangles in the plot above was instrumental for the PP speedup.

If it's acceptable, my plan is to mark this PR as draft and later submit a series of PRs for Q_K optimizations, with the aim of minimizing dramatic changes to the entire codebase.

Yes, no problem with that. I plan to merge #2959 for now, but feel free to update this implementation as you wish.
Let me know if you need any performance stats for M2 as you go

@lshzh-ww
Copy link
Contributor Author

lshzh-ww commented Sep 4, 2023

Let me know if you need any performance stats for M2 as you go

I think your previous benchmarks are sufficient. Thank you for your help again!

@ggerganov ggerganov removed the high priority Very important issue label Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants