metal: template for mat-vec multiplication kernels #2891

lshzh-ww · 2023-08-30T03:04:47Z

This commit provides one template for all mat-vec kernels, making it easier for us to implement multiplication kernels between two quantized types in future.
Also, speedup Q_K and F16 inference speed a little bit.

M1 Max 32c

model	backend	threads	test	master t/s	PR t/s	diff
llama 30B Q4_0	Metal	4	pp 512	68.46 ± 0.02	67.85 ± 0.07	-0.9%
llama 30B Q2_K	Metal	4	pp 512	63.85 ± 0.01	64.20 ± 0.01	0.5%
llama 30B Q3_K_S	Metal	4	pp 512	61.02 ± 0.01	60.53 ± 0.01	-0.8%
llama 30B Q4_K_S	Metal	4	pp 512	61.58 ± 0.03	62.46 ± 0.01	1.4%
LLaMA2 34B Q5_K_S	Metal	4	pp 512	45.97 ± 0.04	47.92 ± 0.01	4.2%
LLaMA2 34B Q6_K	Metal	4	pp 512	43.52 ± 0.03	47.74 ± 0.00	9.7%
llama 30B Q4_0	Metal	4	tg 128	16.10 ± 0.01	16.06 ± 0.01	-0.2%
llama 30B Q2_K	Metal	4	tg 128	15.41 ± 0.00	17.46 ± 0.00	13.3%
llama 30B Q3_K_S	Metal	4	tg 128	11.36 ± 0.00	13.17 ± 0.00	16%
llama 30B Q4_K_S	Metal	4	tg 128	14.48 ± 0.01	15.73 ± 0.01	8.6%
LLaMA2 34B Q5_K_S	Metal	4	tg 128	9.87 ± 0.00	11.89 ± 0.01	20.4%
LLaMA2 34B Q6_K	Metal	4	tg 128	7.95 ± 0.00	10.91 ± 0.01	37.2%
falcon 7B F16	Metal	4	pp 512	154.59 ± 0.53	154.69 ± 0.31	0.0%
falcon 7B F16	Metal	4	tg 128	17.45 ± 0.01	22.83 ± 0.02	30.8%

QK_K=256 was tested on llama models, and QK_K=64 was tested on openllama models. People are welcome to test this pull request.

ggerganov · 2023-08-30T12:19:28Z

M2 Ultra

model	size	backend	test	master t/s	PR t/s	diff
llama2 7B F16	12.55 GiB	Metal	pp 512	666.13 ± 0.17	665.74 ± 0.20	-0.1%
llama2 7B Q8_0	6.67 GiB	Metal	pp 512	631.59 ± 0.25	631.45 ± 0.15	0.0%
llama2 7B Q6_K	5.15 GiB	Metal	pp 512	561.74 ± 0.18	576.44 ± 0.12	2.6%
llama2 7B Q5_K	4.45 GiB	Metal	pp 512	560.83 ± 0.15	580.31 ± 0.11	3.5%
llama2 7B Q4_K	3.80 GiB	Metal	pp 512	586.86 ± 0.14	599.41 ± 0.06	2.1%
llama2 7B Q4_1	3.95 GiB	Metal	pp 512	634.11 ± 0.12	638.90 ± 0.20	0.8%
llama2 7B Q4_0	3.56 GiB	Metal	pp 512	632.20 ± 0.10	634.55 ± 0.12	0.4%
llama2 7B Q3_K	3.07 GiB	Metal	pp 512	580.48 ± 0.23	586.09 ± 0.30	1.0%
llama2 7B Q2_K	2.63 GiB	Metal	pp 512	580.36 ± 0.07	581.80 ± 0.18	0.2%
llama2 7B F16	12.55 GiB	Metal	tg 64	29.59 ± 0.01	40.53 ± 0.03	37.0%
llama2 7B Q8_0	6.67 GiB	Metal	tg 64	61.41 ± 0.02	61.10 ± 0.04	-0.5%
llama2 7B Q6_K	5.15 GiB	Metal	tg 64	68.64 ± 0.03	67.36 ± 0.07	-1.9%
llama2 7B Q5_K	4.45 GiB	Metal	tg 64	69.29 ± 0.03	69.61 ± 0.10	0.5%
llama2 7B Q4_K	3.80 GiB	Metal	tg 64	80.36 ± 0.06	77.33 ± 0.10	-3.8%
llama2 7B Q4_1	3.95 GiB	Metal	tg 64	81.91 ± 0.08	81.58 ± 0.08	-0.4%
llama2 7B Q4_0	3.56 GiB	Metal	tg 64	86.91 ± 0.16	85.58 ± 0.06	-1.5%
llama2 7B Q3_K	3.07 GiB	Metal	tg 64	76.98 ± 0.10	76.93 ± 0.10	-0.1%
llama2 7B Q2_K	2.63 GiB	Metal	tg 64	75.58 ± 0.03	77.57 ± 0.04	2.6%

model	size	backend	test	master t/s	PR t/s	diff
llama2 13B F16	24.24 GiB	Metal	pp 512	389.75 ± 0.09	389.75 ± 0.09	0.0%
llama2 13B Q8_0	12.88 GiB	Metal	pp 512	368.67 ± 0.12	368.90 ± 0.14	0.1%
llama2 13B Q6_K	9.95 GiB	Metal	pp 512	323.95 ± 0.08	333.15 ± 0.07	2.8%
llama2 13B Q5_K	8.60 GiB	Metal	pp 512	320.93 ± 0.05	334.44 ± 0.07	4.2%
llama2 13B Q4_K	7.33 GiB	Metal	pp 512	340.08 ± 0.08	347.71 ± 0.10	2.2%
llama2 13B Q4_1	7.61 GiB	Metal	pp 512	370.19 ± 0.04	373.28 ± 0.07	0.8%
llama2 13B Q4_0	6.86 GiB	Metal	pp 512	369.04 ± 0.11	370.50 ± 0.12	0.4%
llama2 13B Q3_K	5.90 GiB	Metal	pp 512	335.71 ± 0.08	339.16 ± 0.07	1.0%
llama2 13B Q2_K	5.06 GiB	Metal	pp 512	336.00 ± 0.07	336.59 ± 0.07	0.2%
llama2 13B F16	24.24 GiB	Metal	tg 64	16.50 ± 0.02	22.66 ± 0.03	37.3%
llama2 13B Q8_0	12.88 GiB	Metal	tg 64	36.59 ± 0.02	37.11 ± 0.02	1.4%
llama2 13B Q6_K	9.95 GiB	Metal	tg 64	41.73 ± 0.01	42.24 ± 0.03	1.2%
llama2 13B Q5_K	8.60 GiB	Metal	tg 64	42.74 ± 0.01	44.10 ± 0.01	3.2%
llama2 13B Q4_K	7.33 GiB	Metal	tg 64	49.54 ± 0.06	49.43 ± 0.01	-0.2%
llama2 13B Q4_1	7.61 GiB	Metal	tg 64	50.99 ± 0.02	52.13 ± 0.09	2.2%
llama2 13B Q4_0	6.86 GiB	Metal	tg 64	54.77 ± 0.03	54.87 ± 0.02	0.2%
llama2 13B Q3_K	5.90 GiB	Metal	tg 64	46.67 ± 0.05	49.38 ± 0.05	5.8%
llama2 13B Q2_K	5.06 GiB	Metal	tg 64	47.99 ± 0.05	50.10 ± 0.02	4.4%

model	size	backend	test	master t/s	PR t/s	diff
Falcon 7B F16	13.44 GiB	Metal	pp 512	402.47 ± 2.15	402.60 ± 2.03	0.0%
Falcon 7B Q8_0	7.14 GiB	Metal	pp 512	390.22 ± 2.33	390.41 ± 1.90	0.0%
Falcon 7B Q4_0	3.92 GiB	Metal	pp 512	390.37 ± 2.21	391.00 ± 2.45	0.2%
Falcon 7B F16	13.44 GiB	Metal	tg 64	29.43 ± 0.01	38.25 ± 0.05	30.0%
Falcon 7B Q8_0	7.14 GiB	Metal	tg 64	59.90 ± 0.02	60.91 ± 0.03	1.7%
Falcon 7B Q4_0	3.92 GiB	Metal	tg 64	86.01 ± 0.06	85.60 ± 0.04	-0.5%

model	size	backend	test	master t/s	PR t/s	diff
codellama 34B F16	62.85 GiB	Metal	pp 512	149.33 ± 0.03	149.36 ± 0.03	0.0%
codellama 34B Q8_0	33.39 GiB	Metal	pp 512	141.07 ± 0.02	141.13 ± 0.02	0.0%
codellama 34B Q6_K	25.78 GiB	Metal	pp 512	123.82 ± 0.02	127.67 ± 0.02	3.1%
codellama 34B Q5_K_M	22.20 GiB	Metal	pp 512	123.62 ± 0.02	128.60 ± 0.02	4.0%
codellama 34B Q5_K_S	21.64 GiB	Metal	pp 512	123.50 ± 0.01	128.74 ± 0.01	4.2%
codellama 34B Q4_K_M	18.83 GiB	Metal	pp 512	130.63 ± 0.01	133.59 ± 0.03	2.3%
codellama 34B Q4_K_S	17.83 GiB	Metal	pp 512	131.58 ± 0.02	134.42 ± 0.02	2.2%
codellama 34B Q4_1	19.69 GiB	Metal	pp 512	142.03 ± 0.01	143.23 ± 0.02	0.8%
codellama 34B Q4_0	17.74 GiB	Metal	pp 512	141.57 ± 0.03	142.10 ± 0.03	0.4%
codellama 34B Q3_K_M	15.16 GiB	Metal	pp 512	128.39 ± 0.01	129.75 ± 0.02	1.1%
codellama 34B Q3_K_S	13.60 GiB	Metal	pp 512	126.79 ± 0.02	127.25 ± 0.02	0.4%
codellama 34B Q2_K	13.23 GiB	Metal	pp 512	128.11 ± 0.02	128.46 ± 0.01	0.3%
codellama 34B F16	62.85 GiB	Metal	tg 64	7.32 ± 0.00	9.88 ± 0.00	35.0%
codellama 34B Q8_0	33.39 GiB	Metal	tg 64	16.78 ± 0.01	16.75 ± 0.00	-0.2%
codellama 34B Q6_K	25.78 GiB	Metal	tg 64	19.09 ± 0.01	20.01 ± 0.00	4.8%
codellama 34B Q5_K_M	22.20 GiB	Metal	tg 64	20.87 ± 0.01	21.22 ± 0.00	1.7%
codellama 34B Q5_K_S	21.64 GiB	Metal	tg 64	21.21 ± 0.01	21.68 ± 0.00	2.2%
codellama 34B Q4_K_M	18.83 GiB	Metal	tg 64	25.24 ± 0.01	24.45 ± 0.01	-3.1%
codellama 34B Q4_K_S	17.83 GiB	Metal	tg 64	26.37 ± 0.01	25.51 ± 0.00	-3.3%
codellama 34B Q4_1	19.69 GiB	Metal	tg 64	25.69 ± 0.01	25.36 ± 0.01	-1.3%
codellama 34B Q4_0	17.74 GiB	Metal	tg 64	27.76 ± 0.01	27.23 ± 0.01	-1.9%
codellama 34B Q3_K_M	15.16 GiB	Metal	tg 64	23.74 ± 0.02	24.50 ± 0.01	3.2%
codellama 34B Q3_K_S	13.60 GiB	Metal	tg 64	22.84 ± 0.01	24.20 ± 0.02	6.0%
codellama 34B Q2_K	13.23 GiB	Metal	tg 64	23.42 ± 0.01	24.87 ± 0.01	6.2%

Strange that I don't observe the big Q5_K_S and Q6_K TG speedups for 34B

Here are the max threads per kernel:

kernel	max threads (master)	max threads (PR)
kernel_mul_mat_f16_f32	1024	1024
kernel_mul_mv_f16_f32	-	1024
kernel_mul_mv_q4_0_f32	896	1024
kernel_mul_mv_q4_1_f32	896	1024
kernel_mul_mv_q8_0_f32	768	832
kernel_mul_mv_q2_K_f32	640	832
kernel_mul_mv_q3_K_f32	704	640
kernel_mul_mv_q4_K_f32	576	576
kernel_mul_mv_q5_K_f32	576	576
kernel_mul_mv_q6_K_f32	1024	640
kernel_mul_mm_f16_f32	768	768
kernel_mul_mm_q4_0_f32	768	768
kernel_mul_mm_q8_0_f32	768	768
kernel_mul_mm_q4_1_f32	768	768
kernel_mul_mm_q2_K_f32	768	768
kernel_mul_mm_q3_K_f32	768	768
kernel_mul_mm_q4_K_f32	768	768
kernel_mul_mm_q5_K_f32	704	768
kernel_mul_mm_q6_K_f32	704	768

Not sure how these max threads can be utilized. I think they somehow indicate how "parallel" the kernel is. Or maybe how much registers it uses.

lshzh-ww · 2023-08-30T14:51:34Z

@ggerganov
This template reduces the pressure on the memory controller by first loading whole blocks into threadgroup memory. Then, it allows threads to read from the threadgroup, instead of letting threads directly issue multiple loads to device memory. It appears that the M1 series has a weaker memory controller than the M2 series, so the M1 series benefits more from this commit.

The maximum number of threads indicates the register pressure. We should avoid using too many of them, but this depends on the workload.

I am bothered by the fact that we don't achieve doubled inference speed on Ultra chips. For llama 34B Q4_0, the M1 Max inference speed is ~16 tok/s. One would expect the M2 Ultra to reach ~32 tok/s, but in reality, we only have ~27 tok/s. Would you mind profiling this on the M2 Ultra? We can directly profile a built binary, so there's no need to configure llama.cpp to be an Xcode project.

ggerganov · 2023-08-30T18:59:42Z

I can try. How do I profile the binary?

lshzh-ww · 2023-08-30T19:39:27Z

You need to have Xcode downloaded. Open Xcode, and from the Xcode menu click Open Developer Tool->Instruments. Choose Game Performancetemplate.

Then you can click Your Mac and Choose target....

Here I also set arguments for llama.cpp.

At last, open File menu and click Recording options. Please select Performance Limiters in the Counter Set and Performance State to be Maximum. Then Record.

Once you finish running, you can see performance stats for GPU. Currently, for 33B Q4_0 model peak memory read bandwidth is ~385 GB/s.

ggerganov · 2023-08-30T20:40:05Z

It's empty:

I see the GPU utilized in Activity Monitor. Not sure why Xcode does not show read bandwidth

Here is the "Target" view:

Need to logoff - will continue tomorrow

lshzh-ww · 2023-08-31T04:02:51Z

@ggerganov

With the new commits, we should expect to see approximately 0 performance loss on the M2 series for Q4_K and other non-k quantizations.

If Instruments didn't work on M2 Ultra, we can alternatively attempt to only keep the MUL_MAT operations in the graph and estimate memory read bandwidth through token generation speed. Please ensure to also remove the pipeline_mul_mat_f16_f32 kernel. I achieved around 53 ms/token on a 33B Q4_0 model when only conducting quantized matrix-vector multiplications.

ikawrakow · 2023-08-31T16:10:43Z

Have you noticed how matrix multiplication performance, after reaching peak performance at about pp = 64, goes down with increasing context size? This is the case on this branch and on master. Normally we would expect matrix multiplication performance to increase initially and then saturate when the compute capability limit has been reached. Here some sample results on 30-core M2 Max:

model	size	params	backend	ngl	test	t/s
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	100	pp 32	373.07 ± 1.29
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	100	pp 64	430.14 ± 0.78
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	100	pp 128	410.06 ± 0.31
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	100	pp 256	354.64 ± 0.13
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	100	pp 512	267.40 ± 0.07
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	100	pp 1024	196.92 ± 0.06

lshzh-ww · 2023-08-31T16:49:35Z

@ikawrakow
I suspect this is due to not having a good implementation for small matrix multiplications. Currently, we have kernel_mul_mv_* for large matrix-vector multiplication and kernel_mul_mm_* for large matrix-matrix multiplication. We should rewrite kernel_mul_mat_f16_f32 to optimize it for relatively small matrices (when calculating KQ and KQV).

A quick check you can perform is simply adding a break; statement at line 872 of ggml-metal.m in this branch to skip all the kernel_mul_mat_f16_f32 kernels. You will observe a much, much-improved prompt processing speed.

@ggerganov
I think this also answers #2850. EDIT: Partly. Profiling shows that kernel_mul_mm_* runs with nearly the same efficiency on Falcon as on Llama. Therefore, apart from kernel_mul_mat_f16_f32, there might be other kernel slowdowns prompting evaluation on Falcon.

ikawrakow · 2023-08-31T17:17:48Z

@lshzh-ww Not sure I understand your reasoning. The model matrices are of a fixed size. The intermediate result matrices (or activations as they call them) grow with prompt size, and we observe performance degradation. I.e., it is exactly the other way around: performance is good for relatively small matrices (pp = 64) and bad for larger matrices (pp = 512).

lshzh-ww · 2023-08-31T17:41:08Z

(Compute graph created by ggerganov. )

These are the calculations we need to conduct for one layer when evaluating 512 tokens. Currently, we use the optimized kernel_mul_mm_* for nodes in the blue box, and we use the unoptimized kernel_mul_mat_f16_f32 for nodes in the red box. It is this part of the calculations that slows down the entire prompt evaluation. Because this kernel calculates in a straightforward way, the larger the matrices (which are still much smaller than those kernel_mul_mm_* is designed for), the less efficient it becomes.

When I mentioned 'small matrices,' I meant matrices smaller than the model matrices. Regarding the kernel_mul_mat_f16_f32, you are correct: the larger the matrices, the slower it performs. I apologize for any confusion.

ggerganov · 2023-09-01T10:08:52Z

@lshzh-ww

With the following patch applied on this branch, I get:

model	size	backend	th	test	t/s	BW
LLaMA 30B F16	60.59 GiB	Metal	4	tg 128	10.74 ± 0.01	698.7 GB/s
LLaMA 30B Q8_0	32.19 GiB	Metal	4	tg 128	18.84 ± 0.01	651.2 GB/s
LLaMA 30B Q4_0	17.09 GiB	Metal	4	tg 128	32.37 ± 0.02	594.0 GB/s

diff --git a/ggml-metal.m b/ggml-metal.m
index fc656fb..47427f1 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -672,6 +672,8 @@ void ggml_metal_graph_compute(
                 //            dst->name);
                 //}
 
+                if (dst->op != GGML_OP_MUL_MAT) continue;
+
                 switch (dst->op) {
                     case GGML_OP_NONE:
                     case GGML_OP_RESHAPE:
@@ -810,6 +812,7 @@ void ggml_metal_graph_compute(
                                 [ctx->device supportsFamily:MTLGPUFamilyApple7] &&
                                 ne00%32 == 0 &&
                                 ne11 > 1) {
+                                continue;
                                 switch (src0->type) {
                                     case GGML_TYPE_F16:  [encoder setComputePipelineState:ctx->pipeline_mul_mm_f16_f32];  break;
                                     case GGML_TYPE_Q4_0: [encoder setComputePipelineState:ctx->pipeline_mul_mm_q4_0_f32]; break;
@@ -870,6 +873,7 @@ void ggml_metal_graph_compute(
                                 }
                                 [encoder dispatchThreadgroups:MTLSizeMake((ne01 + 7)/8, ne11, ne12) threadsPerThreadgroup:MTLSizeMake(64, 1, 1)];
                             } else {
+                                continue;
                                 switch (src0->type) {
                                     case GGML_TYPE_F16:  [encoder setComputePipelineState:ctx->pipeline_mul_mat_f16_f32];  break;
                                     default: GGML_ASSERT(false && " not implemented");

If I disable all ops in Metal to measure just the overhead of non-Metal stuff, I get: 2.05 ms/t

diff --git a/ggml-metal.m b/ggml-metal.m
index fc656fb..ee2da39 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -672,6 +672,8 @@ void ggml_metal_graph_compute(
                 //            dst->name);
                 //}
 
+                continue;

If I remove this overhead from the previous results, I get the following BW estimate:

model	size	backend	th	test	mul_ mat_mv t/s	BW
LLaMA 30B F16	60.59 GiB	Metal	4	tg 128	10.98	714.3 GB/s
LLaMA 30B Q8_0	32.19 GiB	Metal	4	tg 128	19.60	677.5 GB/s
LLaMA 30B Q4_0	17.09 GiB	Metal	4	tg 128	34.67	636.2 GB/s

ggerganov · 2023-09-01T10:39:08Z

New numbers after merge:

model	size	params	backend	test	t/s
llama2 7B F16	12.55 GiB	6.74 B	Metal	pp 512	766.33 ± 0.16
llama2 7B Q8_0	6.67 GiB	6.74 B	Metal	pp 512	721.86 ± 0.30
llama2 7B Q6_K	5.15 GiB	6.74 B	Metal	pp 512	650.67 ± 0.11
llama2 7B Q5_K	4.45 GiB	6.74 B	Metal	pp 512	655.52 ± 0.19
llama2 7B Q4_K	3.80 GiB	6.74 B	Metal	pp 512	679.89 ± 0.34
llama2 7B Q4_1	3.95 GiB	6.74 B	Metal	pp 512	730.89 ± 0.20
llama2 7B Q4_0	3.56 GiB	6.74 B	Metal	pp 512	725.72 ± 0.13
llama2 7B Q3_K	3.07 GiB	6.74 B	Metal	pp 512	662.73 ± 0.12
llama2 7B Q2_K	2.63 GiB	6.74 B	Metal	pp 512	657.82 ± 0.09
llama2 7B F16	12.55 GiB	6.74 B	Metal	tg 64	41.30 ± 0.01
llama2 7B Q8_0	6.67 GiB	6.74 B	Metal	tg 64	61.41 ± 0.02
llama2 7B Q6_K	5.15 GiB	6.74 B	Metal	tg 64	67.33 ± 0.08
llama2 7B Q5_K	4.45 GiB	6.74 B	Metal	tg 64	69.71 ± 0.04
llama2 7B Q4_K	3.80 GiB	6.74 B	Metal	tg 64	78.63 ± 0.07
llama2 7B Q4_1	3.95 GiB	6.74 B	Metal	tg 64	82.99 ± 0.07
llama2 7B Q4_0	3.56 GiB	6.74 B	Metal	tg 64	87.00 ± 0.05
llama2 7B Q3_K	3.07 GiB	6.74 B	Metal	tg 64	77.33 ± 0.04
llama2 7B Q2_K	2.63 GiB	6.74 B	Metal	tg 64	77.50 ± 0.05

model	size	params	backend	ngl	test	t/s
codellama 34B mostly F16	62.85 GiB	33.74 B	Metal	999	pp 512	188.12 ± 0.03
codellama 34B mostly Q8_0	33.39 GiB	33.74 B	Metal	999	pp 512	175.48 ± 0.04
codellama 34B mostly Q6_K	25.78 GiB	33.74 B	Metal	999	pp 512	154.96 ± 0.04
codellama 34B mostly Q5_K - Medium	22.20 GiB	33.74 B	Metal	999	pp 512	156.34 ± 0.03
codellama 34B mostly Q4_K - Medium	18.83 GiB	33.74 B	Metal	999	pp 512	163.70 ± 0.02
codellama 34B mostly Q4_1	19.69 GiB	33.74 B	Metal	999	pp 512	178.50 ± 0.04
codellama 34B mostly Q4_0	17.74 GiB	33.74 B	Metal	999	pp 512	176.81 ± 0.03
codellama 34B mostly Q3_K - Medium	15.16 GiB	33.74 B	Metal	999	pp 512	158.06 ± 0.02
codellama 34B mostly Q2_K	13.23 GiB	33.74 B	Metal	999	pp 512	156.17 ± 0.02
codellama 34B mostly F16	62.85 GiB	33.74 B	Metal	999	tg 64	9.98 ± 0.01
codellama 34B mostly Q8_0	33.39 GiB	33.74 B	Metal	999	tg 64	16.84 ± 0.01
codellama 34B mostly Q6_K	25.78 GiB	33.74 B	Metal	999	tg 64	20.10 ± 0.00
codellama 34B mostly Q5_K - Medium	22.20 GiB	33.74 B	Metal	999	tg 64	21.35 ± 0.01
codellama 34B mostly Q4_K - Medium	18.83 GiB	33.74 B	Metal	999	tg 64	25.18 ± 0.01
codellama 34B mostly Q4_1	19.69 GiB	33.74 B	Metal	999	tg 64	26.00 ± 0.01
codellama 34B mostly Q4_0	17.74 GiB	33.74 B	Metal	999	tg 64	28.02 ± 0.01
codellama 34B mostly Q3_K - Medium	15.16 GiB	33.74 B	Metal	999	tg 64	24.64 ± 0.01
codellama 34B mostly Q2_K	13.23 GiB	33.74 B	Metal	999	tg 64	25.03 ± 0.01

lshzh-ww · 2023-09-01T14:44:16Z

@ggerganov Thank you for conducting the test! It appears that the M2 Ultra's bandwidth utilization increases as the model size grows larger. Perhaps the 17 GB model is simply too 'tiny' to fully engage it. Good to know this.

ggerganov · 2023-09-01T14:53:47Z

Do you have any idea how to enable my instruments to also show the measured bandwidth like in your Xcode?

lshzh-ww · 2023-09-01T14:58:09Z

I am sorry, I really have no idea why it didn't work. I'm currently using Sonoma Beta in conjunction with Xcode 15, although I'm uncertain if this might be the cause of the issue.

ggerganov · 2023-09-02T08:08:39Z

I compared this branch (PR) with #2959 (IK) on M2 Ultra after rebasing both on latest master:

model	size	test	master t/s	IK t/s	PR t/s	speedup
llama2 7B F16	12.55 GiB	pp 512	762.69 ± 0.21	1097.17 ± 4.26	763.22 ± 0.36	0.696
llama2 7B Q8_0	6.67 GiB	pp 512	718.36 ± 0.22	1010.57 ± 0.35	718.62 ± 0.52	0.711
llama2 7B Q6_K	5.15 GiB	pp 512	629.40 ± 0.39	843.50 ± 0.32	647.82 ± 0.25	0.768
llama2 7B Q5_K_M	4.45 GiB	pp 512	629.98 ± 1.00	841.56 ± 0.56	653.89 ± 1.19	0.777
llama2 7B Q5_K_S	4.33 GiB	pp 512	630.31 ± 0.17	841.35 ± 0.36	656.29 ± 0.16	0.780
llama2 7B Q4_K_M	3.80 GiB	pp 512	663.84 ± 0.30	902.61 ± 0.51	680.03 ± 0.51	0.753
llama2 7B Q4_K_S	3.59 GiB	pp 512	669.04 ± 0.28	911.13 ± 0.61	684.99 ± 0.14	0.752
llama2 7B Q4_1	3.95 GiB	pp 512	724.63 ± 0.25	1018.29 ± 0.63	730.77 ± 0.34	0.718
llama2 7B Q4_0	3.56 GiB	pp 512	722.54 ± 0.12	1013.09 ± 1.23	725.25 ± 0.40	0.716
llama2 7B Q3_K_M	3.07 GiB	pp 512	656.13 ± 0.19	886.12 ± 0.48	663.30 ± 0.17	0.749
llama2 7B Q3_K_S	2.75 GiB	pp 512	646.84 ± 0.14	870.25 ± 0.76	649.29 ± 0.34	0.746
llama2 7B Q2_K	2.63 GiB	pp 512	655.79 ± 0.13	886.24 ± 0.14	657.04 ± 0.45	0.741
llama2 7B F16	12.55 GiB	tg 64	18.64 ± 0.01	20.08 ± 0.02	41.40 ± 0.01	2.062
llama2 7B Q8_0	6.67 GiB	tg 64	61.47 ± 0.02	61.80 ± 0.03	61.62 ± 0.02	0.997
llama2 7B Q6_K	5.15 GiB	tg 64	68.93 ± 0.03	69.29 ± 0.01	67.71 ± 0.03	0.977
llama2 7B Q5_K_M	4.45 GiB	tg 64	69.52 ± 0.06	69.97 ± 0.02	69.66 ± 0.03	0.996
llama2 7B Q5_K_S	4.33 GiB	tg 64	70.48 ± 0.05	70.99 ± 0.01	71.66 ± 0.07	1.009
llama2 7B Q4_K_M	3.80 GiB	tg 64	80.61 ± 0.06	81.08 ± 0.04	78.97 ± 0.06	0.974
llama2 7B Q4_K_S	3.59 GiB	tg 64	83.15 ± 0.04	84.08 ± 0.05	82.82 ± 0.11	0.985
llama2 7B Q4_1	3.95 GiB	tg 64	82.30 ± 0.05	82.70 ± 0.05	83.25 ± 0.04	1.007
llama2 7B Q4_0	3.56 GiB	tg 64	86.98 ± 0.06	87.59 ± 0.07	87.18 ± 0.07	0.995
llama2 7B Q3_K_M	3.07 GiB	tg 64	77.31 ± 0.05	78.10 ± 0.04	77.20 ± 0.11	0.988
llama2 7B Q3_K_S	2.75 GiB	tg 64	74.87 ± 0.02	75.21 ± 0.03	76.26 ± 0.05	1.014
llama2 7B Q2_K	2.63 GiB	tg 64	75.78 ± 0.07	76.43 ± 0.12	77.57 ± 0.11	1.015

model	size	test	master t/s	IK t/s	PR t/s	speedup
llama2 13B F16	24.24 GiB	pp 512	441.69 ± 0.18	609.11 ± 0.44	441.45 ± 0.46	0.725
llama2 13B Q8_0	12.88 GiB	pp 512	414.46 ± 0.26	559.76 ± 1.37	416.57 ± 0.12	0.744
llama2 13B Q6_K	9.95 GiB	pp 512	359.83 ± 0.19	463.36 ± 0.18	371.47 ± 0.06	0.802
llama2 13B Q5_K	8.60 GiB	pp 512	356.78 ± 0.13	457.70 ± 0.12	373.40 ± 0.13	0.816
llama2 13B Q4_K	7.33 GiB	pp 512	380.05 ± 0.19	497.48 ± 0.25	389.86 ± 0.14	0.784
llama2 13B Q4_1	7.61 GiB	pp 512	418.45 ± 0.14	564.94 ± 0.16	422.28 ± 0.26	0.747
llama2 13B Q4_0	6.86 GiB	pp 512	416.72 ± 0.19	561.99 ± 0.29	418.70 ± 0.13	0.745
llama2 13B Q3_K	5.90 GiB	pp 512	374.65 ± 0.12	488.21 ± 0.14	378.74 ± 0.13	0.776
llama2 13B Q2_K	5.06 GiB	pp 512	374.98 ± 0.05	488.60 ± 0.22	375.74 ± 0.06	0.769
llama2 13B F16	24.24 GiB	tg 64	10.62 ± 0.01	11.39 ± 0.01	23.39 ± 0.00	2.054
llama2 13B Q8_0	12.88 GiB	tg 64	36.72 ± 0.01	36.74 ± 0.01	37.03 ± 0.01	1.008
llama2 13B Q6_K	9.95 GiB	tg 64	41.82 ± 0.03	41.97 ± 0.02	42.37 ± 0.02	1.010
llama2 13B Q5_K	8.60 GiB	tg 64	42.87 ± 0.01	43.14 ± 0.01	44.15 ± 0.02	1.023
llama2 13B Q4_K	7.33 GiB	tg 64	49.70 ± 0.02	50.16 ± 0.02	51.05 ± 0.03	1.018
llama2 13B Q4_1	7.61 GiB	tg 64	51.21 ± 0.02	51.46 ± 0.02	51.95 ± 0.04	1.010
llama2 13B Q4_0	6.86 GiB	tg 64	55.08 ± 0.01	55.21 ± 0.03	55.19 ± 0.03	1.000
llama2 13B Q3_K	5.90 GiB	tg 64	46.82 ± 0.03	47.50 ± 0.03	50.08 ± 0.03	1.054
llama2 13B Q2_K	5.06 GiB	tg 64	48.17 ± 0.02	48.36 ± 0.03	50.28 ± 0.03	1.040

model	size	test	master t/s	IK t/s	PR t/s	speedup
LLaMA 7B Q4_0	3.56 GiB	pp 32	701.75 ± 4.36	712.61 ± 2.79	704.11 ± 2.12	0.988
LLaMA 7B Q4_0	3.56 GiB	pp 64	879.19 ± 3.16	919.36 ± 5.79	883.43 ± 1.79	0.961
LLaMA 7B Q4_0	3.56 GiB	pp 128	964.09 ± 1.39	1060.88 ± 1.97	971.69 ± 1.38	0.916
LLaMA 7B Q4_0	3.56 GiB	pp 256	888.98 ± 1.02	1073.01 ± 1.69	895.68 ± 0.71	0.835
LLaMA 7B Q4_0	3.56 GiB	pp 512	719.63 ± 0.23	1009.82 ± 0.96	725.90 ± 0.19	0.719
LLaMA 7B Q4_0	3.56 GiB	pp 1024	549.62 ± 0.86	844.49 ± 1.62	552.02 ± 0.27	0.654
LLaMA 7B Q4_0	3.56 GiB	tg 128	87.45 ± 0.10	87.65 ± 0.17	87.16 ± 0.12	0.994

~~Excluding the very nice F16 TG speedup of a factor of more than x2, there is a modest improvement across the board.~~

However, I'm looking at the new implementation and I am not really happy with the heavy templateization of the code.
I find it much more difficult to read and to understand and I have strong doubts about merging it.

Would like to take some more time to review the code and post some more specific comments of what I find concerning

ggerganov

These are some general comments about what I like and don't like in this implementation. I know they may sound opinionated and it's fine to disagree.

I'm just worried that going down the "template" road will increase the technical debt in the long run and make the code very hard to understand and modify. For example, when I try to work on the CUDA implementation nowadays (where this type of change has already happened), it is very difficult for me to make a change without breaking everything else. Abstracting things like this and putting them into common patterns will always lead to those kind of problems. We have to be very confident that these kernels are the best possible way to compute before we "bind" their implementations together. Otherwise, we won't be able to iterate and improve the performance, or at the very least it will be exceedingly difficult.

On first look, having less LOCs might always look like a good result. But I think it's more important to consider the cognitive load when reading the code. 1000 lines of straight-forward, repeatable and easily parsable code are better than 100 lines of abstraction with 2 levels of nested indirection in it.

Regarding the upcoming experiments related to quantized matrix multiplication that I believe will further improve the performance:

This commit provides one template for all mat-vec kernels, making it easier for us to implement multiplication kernels between two quantized types in future.

I don't think it will help. It's likely to make it more difficult - we will be trying to fit the new things into the old pattern, instead of being able to freely hack and experiment with new type of implementations that can turn out to be more effective for that purpose.

Hope this does not come as a too hard critique and disappoint you - your contributions have been extremely valuable and I will be happy to see more of them.
But this particular change likely won't make it to master

ggerganov · 2023-09-02T08:49:08Z

ggml-metal.m

+    GGML_METAL_DECL_KERNEL(mul_mv_q3_K_f32);
+    GGML_METAL_DECL_KERNEL(mul_mv_q4_K_f32);
+    GGML_METAL_DECL_KERNEL(mul_mv_q5_K_f32);
+    GGML_METAL_DECL_KERNEL(mul_mv_q6_K_f32);


This rename is OK

ggerganov · 2023-09-02T08:51:59Z

ggml-metal.metal

+typedef struct {
+    half    d;         // delta
+    int8_t  qs[QK8_0]; // quants
+} block_q8_0;


Moving this here from the top of the file is OK

ggerganov · 2023-09-02T09:40:38Z

ggml-metal.metal

 }

-//====================================== dot products =========================
-
-kernel void kernel_mul_mat_q2_K_f32(


I wouldn't change these kernels. Just rename them to reflect that it is a matrix x vector multiplication

ggerganov · 2023-09-02T09:46:39Z

ggml-metal.metal


-template<typename block_q, short nl, void (*dequantize_func)(device const block_q *, short, thread float4x4 &)>


This type of templateization is not OK

It does reduce the code, but that's the only good thing about it.
The signature is complicated and we can no longer search quantization strings - i.e. searching for "4_0" will skip this template. This is a problem, because very often we want to go through all "4_0" related source code and do some change or compare with other quantizations. This being a template makes the process more difficult by introducing template argument indirection.

My preference is to have a much longer and easily parsable code with few or none template argument indirections.

ggerganov · 2023-09-02T10:00:29Z

ggml-metal.metal

+template<typename block_q_type, int nr, int nsg, int nl, int n_shift, template<typename, typename, typename> class quant_dri>
+kernel void kernel_mat_mv(device const   void * src0,
+                          device const  float * src1,


This is signature is not OK - I need a few minutes of thinking every time I see it to understand what is the meaning of the template args

ggerganov · 2023-09-02T10:07:35Z

ggml-metal.metal

+template <typename addr_uint16_p,typename addr_block_q_p, typename type4x4>
+class q4_0_driver {
+    public:
+        uint16_t mask1, mask2, q_offset;


I don't think we need high-level constructs for this type of computation code.

Having a struct with some data is fine, but all the functionality that operates on this data has to be imlpemented as plain functions with descriptive names.
Adding a class with methods here, introduces another level of indirection - when I see a call to inner_product(), I have to go back and look at the type of class. Instead, when I see for example ggml_vec_dot_q4_0_q8_0() I immediately know what is going on.

ggerganov · 2023-09-02T10:28:47Z

ggml-metal.metal

-inline float block_q_n_dot_y(device const block_q4_1 * qb_curr, float sumy, thread float * yl, int il) {
-    float d = qb_curr->d;
-    float m = qb_curr->m;
-    device const uint16_t * qs = ((device const uint16_t *)qb_curr + 2 + il/2);


It's good that these kind of hardcoded pointer offsets are avoided in the new version:

Suggested change

device const uint16_t * qs = ((device const uint16_t *)qb_curr + 2 + il/2);

device const uint16_t * qs = ((device const uint16_t *)qb_curr->qs + il/2);

ggerganov · 2023-09-02T10:51:07Z

ggml-metal.metal

-template <typename type4x4>
-void dequantize_f16(device const half4x4 * src, short il, thread type4x4 & reg) {
-    half4x4 temp = *(((device half4x4 *)src));
-    for (int i = 0; i < 16; i++){
-        reg[i/4][i%4] = temp[i/4][i%4];
-    }
-}


These kind of templates are OK - single argument, reduces code duplication, does not hide the quantization type, easy to parse

ggerganov · 2023-09-02T11:22:14Z

ggml-metal.metal

@@ -1921,7 +1411,7 @@ kernel void kernel_get_rows(
 #define SG_MAT_ROW 8

 // each block_q contains 16*nl weights
-template<typename block_q, short nl, void (*dequantize_func)(device const block_q *, short, thread half4x4 &)>
+template<typename block_q_type, short nl, template<typename, typename, typename> class quant_dri>
 kernel void kernel_mul_mm(device const  uchar * src0,
                           device const  float * src1,
                           device        float * dst,


The body of this template has to be simplified by factoring out the code that does not depend on the quantization type into separate function calls. We should try to avoid the template by having separate inline implementations of the kernels for each quantization, reusing the common functions and calling the respective dequantize_ functions.

lshzh-ww · 2023-09-03T02:41:33Z

@ggerganov

I realized that readability is a top concern to an open source project, where collective collaboration is key. I guessed I may have been a bit too fixated on reducing LOCs when I initially opened on this PR. I also learned that learned that refactoring existing code should not be undertaken without substantial justification and conversation with other people.

The primary objective of this PR is to introduce a new function, kernel_mul_mv_f16_f32, which will enable us to specialize the kernel_mul_mat_f16_f32 function for performing permuted matrix multiplications. Given that this has already been implemented in PR #2959, I don't see an immediate urgency in merging this PR. If it's acceptable, my plan is to mark this PR as draft and later submit a series of PRs for Q_K optimizations, with the aim of minimizing dramatic changes to the entire codebase. The current Q_K kernels are already performing well, as shown by benchmark results on your M2 Ultra. It's just M1's memory controller needs some extra help.

ggerganov · 2023-09-03T08:02:58Z

The primary objective of this PR is to introduce a new function, kernel_mul_mv_f16_f32, which will enable us to specialize the kernel_mul_mat_f16_f32 function for performing permuted matrix multiplications.

Yes, I think your insight about optimizing the F16 ops in the red rectangles in the plot above was instrumental for the PP speedup.

If it's acceptable, my plan is to mark this PR as draft and later submit a series of PRs for Q_K optimizations, with the aim of minimizing dramatic changes to the entire codebase.

Yes, no problem with that. I plan to merge #2959 for now, but feel free to update this implementation as you wish.
Let me know if you need any performance stats for M2 as you go

lshzh-ww · 2023-09-04T00:25:49Z

Let me know if you need any performance stats for M2 as you go

I think your previous benchmarks are sufficient. Thank you for your help again!

lshzh-ww requested a review from ggerganov August 30, 2023 03:04

lshzh-ww force-pushed the metal-kern-mul-mv branch from a404dc3 to 507ba74 Compare August 30, 2023 03:10

metal: template for mat-vec multiplication kernels

bca5d0c

lshzh-ww force-pushed the metal-kern-mul-mv branch from 507ba74 to bca5d0c Compare August 30, 2023 03:24

ggerganov added high priority Very important issue performance Speed related topics labels Aug 30, 2023

metal : minor indentations

804c78d

metal: improvement for Q4_K driver

aa4b7d2

lshzh-ww force-pushed the metal-kern-mul-mv branch from 5edbba1 to aa4b7d2 Compare August 31, 2023 03:05

metal: yet another MUL mat-vec template

35cd10c

lshzh-ww force-pushed the metal-kern-mul-mv branch from 927b13c to 35cd10c Compare August 31, 2023 03:46

ggerganov mentioned this pull request Sep 1, 2023

metal: somewhat faster f16 x f32 matrix multiply kernel #2951

Merged

Merge branch 'master' into HEAD

048831e

Merge branch 'master' into HEAD

b46ae7b

ggerganov reviewed Sep 2, 2023

View reviewed changes

ikawrakow mentioned this pull request Sep 2, 2023

More optimizations on metal #2959

Merged

ggerganov removed the high priority Very important issue label Sep 14, 2023


		template<typename block_q, short nl, void (dequantize_func)(device const block_q , short, thread float4x4 &)>

	device const uint16_t * qs = ((device const uint16_t *)qb_curr + 2 + il/2);
	device const uint16_t * qs = ((device const uint16_t *)qb_curr->qs + il/2);

metal: template for mat-vec multiplication kernels #2891

Are you sure you want to change the base?

metal: template for mat-vec multiplication kernels #2891

Uh oh!

Conversation

lshzh-ww commented Aug 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lshzh-ww commented Aug 30, 2023

Uh oh!

ggerganov commented Aug 30, 2023

Uh oh!

lshzh-ww commented Aug 30, 2023

Uh oh!

ggerganov commented Aug 30, 2023

Uh oh!

lshzh-ww commented Aug 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Aug 31, 2023

Uh oh!

lshzh-ww commented Aug 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Aug 31, 2023

Uh oh!

lshzh-ww commented Aug 31, 2023

Uh oh!

ggerganov commented Sep 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lshzh-ww commented Sep 1, 2023

Uh oh!

ggerganov commented Sep 1, 2023

Uh oh!

lshzh-ww commented Sep 1, 2023

Uh oh!

ggerganov commented Sep 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Sep 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lshzh-ww commented Sep 3, 2023

Uh oh!

ggerganov commented Sep 3, 2023

Uh oh!

lshzh-ww commented Sep 4, 2023

Uh oh!

Uh oh!

lshzh-ww commented Aug 30, 2023 •

edited

Loading

ggerganov commented Aug 30, 2023 •

edited

Loading

lshzh-ww commented Aug 31, 2023 •

edited

Loading

lshzh-ww commented Aug 31, 2023 •

edited

Loading

ggerganov commented Sep 1, 2023 •

edited

Loading

ggerganov commented Sep 1, 2023 •

edited

Loading

ggerganov commented Sep 2, 2023 •

edited

Loading

ggerganov Sep 2, 2023 •

edited

Loading