Skip to content

Conversation

@GermanAizek
Copy link
Contributor

@GermanAizek GermanAizek commented Dec 1, 2025

This simply change has greatly affected vec_dot_q, and in my tests it floats strongly, sometimes 2-3-4 times almost higher, and sometimes a 2-3 times worse.

image

Full Benchmark

devuan@devuan:~/GIT/llama.cpp/cmake-build-release/bin$ ./test-quantize-perf > opt.txt
devuan@devuan:~/GIT/llama.cpp/cmake-build-release/bin$ ./test-quantize-perf > master.txt
devuan@devuan:~/GIT/llama.cpp/cmake-build-release/bin$ diff -u master.txt opt.txt | colordiff
--- master.txt  2025-12-01 12:14:36.491798486 +0300
+++ opt.txt     2025-12-01 12:13:58.491799972 +0300
@@ -1,107 +1,107 @@
 f16
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :    294.47
-      avg cycles/32 vals   :    319.96
-      float32 throughput   :      0.85 GB/s
-      quantized throughput :      0.43 GB/s
+      min cycles/32 vals   :    188.25
+      avg cycles/32 vals   :    188.48
+      float32 throughput   :      1.44 GB/s
+      quantized throughput :      0.72 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      4.67
-      avg cycles/32 vals   :     12.89
-      float32 throughput   :      8.03 GB/s
-      quantized throughput :      4.02 GB/s
+      min cycles/32 vals   :      4.55
+      avg cycles/32 vals   :     25.29
+      float32 throughput   :     10.17 GB/s
+      quantized throughput :      5.09 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     85.10
-      avg cycles/32 vals   :     86.52
-      float32 throughput   :      3.11 GB/s
-      quantized throughput :      1.56 GB/s
+      min cycles/32 vals   :     84.80
+      avg cycles/32 vals   :     85.95
+      float32 throughput   :      3.18 GB/s
+      quantized throughput :      1.59 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :      4.55
-      avg cycles/32 vals   :      4.70
+      min cycles/32 vals   :      4.54
+      avg cycles/32 vals   :      4.67
       float32 throughput   :     50.86 GB/s
       quantized throughput :     25.43 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
       min cycles/32 vals   :      5.30
-      avg cycles/32 vals   :      5.40
-      float32 throughput   :     50.86 GB/s
-      quantized throughput :     25.43 GB/s
+      avg cycles/32 vals   :      5.38
+      float32 throughput   :     38.15 GB/s
+      quantized throughput :     19.07 GB/s
 
 q4_0
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :    128.62
-      avg cycles/32 vals   :    128.80
+      min cycles/32 vals   :    128.58
+      avg cycles/32 vals   :    128.76
       float32 throughput   :      2.12 GB/s
       quantized throughput :      0.30 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :    128.67
-      avg cycles/32 vals   :    128.78
+      min cycles/32 vals   :    128.66
+      avg cycles/32 vals   :    128.79
       float32 throughput   :      2.12 GB/s
       quantized throughput :      0.30 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     15.22
+      min cycles/32 vals   :     15.20
       avg cycles/32 vals   :     15.32
       float32 throughput   :     16.95 GB/s
       quantized throughput :      2.38 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :     29.12
-      avg cycles/32 vals   :     29.27
+      min cycles/32 vals   :     29.11
+      avg cycles/32 vals   :     29.33
       float32 throughput   :      8.98 GB/s
       quantized throughput :      1.26 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      4.52
-      avg cycles/32 vals   :      4.72
+      min cycles/32 vals   :      4.54
+      avg cycles/32 vals   :      4.67
       float32 throughput   :     50.86 GB/s
       quantized throughput :      7.15 GB/s
 
 q4_1
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :    109.82
-      avg cycles/32 vals   :    109.99
+      min cycles/32 vals   :    110.03
+      avg cycles/32 vals   :    110.14
       float32 throughput   :      2.46 GB/s
       quantized throughput :      0.38 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :    109.97
-      avg cycles/32 vals   :    110.22
+      min cycles/32 vals   :    110.03
+      avg cycles/32 vals   :    110.16
       float32 throughput   :      2.46 GB/s
       quantized throughput :      0.38 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     16.70
-      avg cycles/32 vals   :     17.95
-      float32 throughput   :     15.26 GB/s
-      quantized throughput :      2.38 GB/s
+      min cycles/32 vals   :     16.71
+      avg cycles/32 vals   :     18.11
+      float32 throughput   :     13.87 GB/s
+      quantized throughput :      2.17 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :     51.41
-      avg cycles/32 vals   :     51.86
+      min cycles/32 vals   :     51.55
+      avg cycles/32 vals   :     51.92
       float32 throughput   :      5.09 GB/s
       quantized throughput :      0.79 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      4.69
+      min cycles/32 vals   :      4.67
       avg cycles/32 vals   :      4.80
       float32 throughput   :     50.86 GB/s
       quantized throughput :      7.95 GB/s
@@ -109,103 +109,103 @@
 q5_0
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :    240.55
-      avg cycles/32 vals   :    240.88
+      min cycles/32 vals   :    240.52
+      avg cycles/32 vals   :    240.76
       float32 throughput   :      1.13 GB/s
       quantized throughput :      0.19 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :    240.45
-      avg cycles/32 vals   :    240.78
+      min cycles/32 vals   :    240.65
+      avg cycles/32 vals   :    240.89
       float32 throughput   :      1.13 GB/s
       quantized throughput :      0.19 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     96.15
-      avg cycles/32 vals   :     96.28
+      min cycles/32 vals   :     96.21
+      avg cycles/32 vals   :     96.32
       float32 throughput   :      2.83 GB/s
       quantized throughput :      0.49 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
       min cycles/32 vals   :     29.11
-      avg cycles/32 vals   :     29.27
-      float32 throughput   :      8.98 GB/s
-      quantized throughput :      1.54 GB/s
+      avg cycles/32 vals   :     29.30
+      float32 throughput   :     10.17 GB/s
+      quantized throughput :      1.75 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      6.12
-      avg cycles/32 vals   :      6.27
-      float32 throughput   :     38.15 GB/s
-      quantized throughput :      6.56 GB/s
+      min cycles/32 vals   :      6.14
+      avg cycles/32 vals   :      6.21
+      float32 throughput   :     50.86 GB/s
+      quantized throughput :      8.74 GB/s
 
 q5_1
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :    205.94
-      avg cycles/32 vals   :    206.23
-      float32 throughput   :      1.32 GB/s
+      min cycles/32 vals   :    205.76
+      avg cycles/32 vals   :    206.57
+      float32 throughput   :      1.33 GB/s
       quantized throughput :      0.25 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :    206.43
-      avg cycles/32 vals   :    206.90
+      min cycles/32 vals   :    206.59
+      avg cycles/32 vals   :    206.78
       float32 throughput   :      1.32 GB/s
       quantized throughput :      0.25 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     99.19
-      avg cycles/32 vals   :     99.37
+      min cycles/32 vals   :     98.55
+      avg cycles/32 vals   :     98.70
       float32 throughput   :      2.72 GB/s
       quantized throughput :      0.51 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :     51.09
-      avg cycles/32 vals   :     51.68
-      float32 throughput   :      5.26 GB/s
-      quantized throughput :      0.99 GB/s
+      min cycles/32 vals   :     51.59
+      avg cycles/32 vals   :     52.24
+      float32 throughput   :      5.09 GB/s
+      quantized throughput :      0.95 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      6.03
-      avg cycles/32 vals   :      6.09
+      min cycles/32 vals   :      6.02
+      avg cycles/32 vals   :      6.20
       float32 throughput   :     38.15 GB/s
       quantized throughput :      7.15 GB/s
 
 q8_0
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :    303.78
-      avg cycles/32 vals   :    304.45
-      float32 throughput   :      0.90 GB/s
+      min cycles/32 vals   :    303.77
+      avg cycles/32 vals   :    304.66
+      float32 throughput   :      0.89 GB/s
       quantized throughput :      0.24 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     29.29
-      avg cycles/32 vals   :     29.34
-      float32 throughput   :      9.54 GB/s
-      quantized throughput :      2.53 GB/s
+      min cycles/32 vals   :     29.27
+      avg cycles/32 vals   :     29.39
+      float32 throughput   :      8.98 GB/s
+      quantized throughput :      2.38 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     14.29
-      avg cycles/32 vals   :     14.37
+      min cycles/32 vals   :     14.24
+      avg cycles/32 vals   :     14.32
       float32 throughput   :     16.95 GB/s
       quantized throughput :      4.50 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :     29.14
-      avg cycles/32 vals   :     29.30
-      float32 throughput   :      9.54 GB/s
-      quantized throughput :      2.53 GB/s
+      min cycles/32 vals   :     29.11
+      avg cycles/32 vals   :     29.36
+      float32 throughput   :      8.98 GB/s
+      quantized throughput :      2.38 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
@@ -217,396 +217,396 @@
 q2_K
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :   4242.51
-      avg cycles/32 vals   :   4257.87
+      min cycles/32 vals   :   4241.52
+      avg cycles/32 vals   :   4291.42
       float32 throughput   :      0.06 GB/s
       quantized throughput :      0.01 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :   4240.97
-      avg cycles/32 vals   :   4320.46
+      min cycles/32 vals   :   4239.26
+      avg cycles/32 vals   :   4364.00
       float32 throughput   :      0.06 GB/s
       quantized throughput :      0.01 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
       min cycles/32 vals   :     80.86
-      avg cycles/32 vals   :     80.97
+      avg cycles/32 vals   :     81.07
       float32 throughput   :      3.32 GB/s
       quantized throughput :      0.27 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :    126.68
-      avg cycles/32 vals   :    127.06
+      min cycles/32 vals   :    126.38
+      avg cycles/32 vals   :    126.94
       float32 throughput   :      2.15 GB/s
       quantized throughput :      0.18 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      2.70
-      avg cycles/32 vals   :      2.85
-      float32 throughput   :     76.29 GB/s
-      quantized throughput :      6.26 GB/s
+      min cycles/32 vals   :      8.34
+      avg cycles/32 vals   :      8.63
+      float32 throughput   :     25.43 GB/s
+      quantized throughput :      2.09 GB/s
 
 q3_K
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :    463.23
-      avg cycles/32 vals   :    465.04
-      float32 throughput   :      0.59 GB/s
+      min cycles/32 vals   :    464.61
+      avg cycles/32 vals   :    466.17
+      float32 throughput   :      0.58 GB/s
       quantized throughput :      0.06 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :    463.86
-      avg cycles/32 vals   :    464.66
+      min cycles/32 vals   :    464.12
+      avg cycles/32 vals   :    465.35
       float32 throughput   :      0.59 GB/s
       quantized throughput :      0.06 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     23.77
-      avg cycles/32 vals   :     23.91
+      min cycles/32 vals   :     23.75
+      avg cycles/32 vals   :     23.86
       float32 throughput   :     11.74 GB/s
       quantized throughput :      1.26 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :    126.68
+      min cycles/32 vals   :    126.65
       avg cycles/32 vals   :    126.96
-      float32 throughput   :      2.18 GB/s
+      float32 throughput   :      2.15 GB/s
       quantized throughput :      0.23 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      8.07
+      min cycles/32 vals   :      7.97
       avg cycles/32 vals   :      8.13
-      float32 throughput   :     30.52 GB/s
-      quantized throughput :      3.28 GB/s
+      float32 throughput   :     76.29 GB/s
+      quantized throughput :      8.20 GB/s
 
 q4_K
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :   5065.16
-      avg cycles/32 vals   :   5100.72
+      min cycles/32 vals   :   5060.34
+      avg cycles/32 vals   :   5105.36
       float32 throughput   :      0.05 GB/s
       quantized throughput :      0.01 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :   5065.95
-      avg cycles/32 vals   :   5067.42
+      min cycles/32 vals   :   5058.95
+      avg cycles/32 vals   :   5071.13
       float32 throughput   :      0.05 GB/s
       quantized throughput :      0.01 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     15.34
-      avg cycles/32 vals   :     15.45
+      min cycles/32 vals   :     15.32
+      avg cycles/32 vals   :     15.41
       float32 throughput   :     16.95 GB/s
       quantized throughput :      2.38 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :    126.81
-      avg cycles/32 vals   :    127.04
+      min cycles/32 vals   :    126.63
+      avg cycles/32 vals   :    127.05
       float32 throughput   :      2.15 GB/s
       quantized throughput :      0.30 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      4.13
-      avg cycles/32 vals   :     25.95
-      float32 throughput   :     10.17 GB/s
-      quantized throughput :      1.43 GB/s
+      min cycles/32 vals   :      4.30
+      avg cycles/32 vals   :     24.27
+      float32 throughput   :     10.90 GB/s
+      quantized throughput :      1.53 GB/s
 
 q5_K
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :   3977.85
-      avg cycles/32 vals   :   3989.26
+      min cycles/32 vals   :   3974.84
+      avg cycles/32 vals   :   4029.74
       float32 throughput   :      0.07 GB/s
       quantized throughput :      0.01 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :   3979.48
-      avg cycles/32 vals   :   4016.21
+      min cycles/32 vals   :   3972.98
+      avg cycles/32 vals   :   4145.78
       float32 throughput   :      0.07 GB/s
       quantized throughput :      0.01 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     20.53
-      avg cycles/32 vals   :     20.78
+      min cycles/32 vals   :     20.66
+      avg cycles/32 vals   :     20.86
       float32 throughput   :     12.72 GB/s
       quantized throughput :      2.19 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :    126.57
-      avg cycles/32 vals   :    185.54
-      float32 throughput   :      1.47 GB/s
-      quantized throughput :      0.25 GB/s
+      min cycles/32 vals   :    126.77
+      avg cycles/32 vals   :    127.22
+      float32 throughput   :      2.18 GB/s
+      quantized throughput :      0.37 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      7.17
-      avg cycles/32 vals   :     26.32
+      min cycles/32 vals   :      7.12
+      avg cycles/32 vals   :     26.74
       float32 throughput   :     10.17 GB/s
       quantized throughput :      1.75 GB/s
 
 q6_K
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :   2758.61
-      avg cycles/32 vals   :   2784.79
+      min cycles/32 vals   :   2762.16
+      avg cycles/32 vals   :   2792.98
       float32 throughput   :      0.10 GB/s
       quantized throughput :      0.02 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :   2763.22
-      avg cycles/32 vals   :   2774.24
+      min cycles/32 vals   :   2763.62
+      avg cycles/32 vals   :   2799.36
       float32 throughput   :      0.10 GB/s
       quantized throughput :      0.02 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :    103.45
-      avg cycles/32 vals   :    103.57
-      float32 throughput   :      2.63 GB/s
-      quantized throughput :      0.54 GB/s
+      min cycles/32 vals   :    103.27
+      avg cycles/32 vals   :    103.51
+      float32 throughput   :      2.59 GB/s
+      quantized throughput :      0.53 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :    126.73
-      avg cycles/32 vals   :    126.87
+      min cycles/32 vals   :    126.66
+      avg cycles/32 vals   :    127.07
       float32 throughput   :      2.15 GB/s
       quantized throughput :      0.44 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      7.79
-      avg cycles/32 vals   :     27.25
-      float32 throughput   :     10.17 GB/s
-      quantized throughput :      2.09 GB/s
+      min cycles/32 vals   :      7.74
+      avg cycles/32 vals   :      7.93
+      float32 throughput   :     38.15 GB/s
+      quantized throughput :      7.82 GB/s
 
 iq4_nl
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :   1559.49
-      avg cycles/32 vals   :   1579.94
+      min cycles/32 vals   :   1566.73
+      avg cycles/32 vals   :   1568.58
       float32 throughput   :      0.17 GB/s
       quantized throughput :      0.02 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :   1559.31
-      avg cycles/32 vals   :   1569.49
-      float32 throughput   :      0.17 GB/s
+      min cycles/32 vals   :   1566.66
+      avg cycles/32 vals   :   1704.78
+      float32 throughput   :      0.16 GB/s
       quantized throughput :      0.02 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :    114.20
-      avg cycles/32 vals   :    114.29
+      min cycles/32 vals   :    114.15
+      avg cycles/32 vals   :    114.23
       float32 throughput   :      2.38 GB/s
       quantized throughput :      0.34 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :     29.24
-      avg cycles/32 vals   :     29.32
-      float32 throughput   :      8.98 GB/s
-      quantized throughput :      1.26 GB/s
+      min cycles/32 vals   :     29.11
+      avg cycles/32 vals   :     29.30
+      float32 throughput   :      9.54 GB/s
+      quantized throughput :      1.34 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
       min cycles/32 vals   :      4.90
-      avg cycles/32 vals   :      4.99
-      float32 throughput   :     50.86 GB/s
-      quantized throughput :      7.15 GB/s
+      avg cycles/32 vals   :      5.00
+      float32 throughput   :     76.29 GB/s
+      quantized throughput :     10.73 GB/s
 
 iq4_xs
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :  23011.23
-      avg cycles/32 vals   :  23112.32
+      min cycles/32 vals   :  23020.66
+      avg cycles/32 vals   :  23417.59
       float32 throughput   :      0.01 GB/s
       quantized throughput :      0.00 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :  22991.27
-      avg cycles/32 vals   :  23075.28
+      min cycles/32 vals   :  23020.19
+      avg cycles/32 vals   :  23368.66
       float32 throughput   :      0.01 GB/s
       quantized throughput :      0.00 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     87.52
-      avg cycles/32 vals   :     87.80
-      float32 throughput   :      3.11 GB/s
+      min cycles/32 vals   :     87.66
+      avg cycles/32 vals   :     87.82
+      float32 throughput   :      3.05 GB/s
       quantized throughput :      0.41 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :    126.62
-      avg cycles/32 vals   :    126.98
-      float32 throughput   :      2.15 GB/s
-      quantized throughput :      0.29 GB/s
+      min cycles/32 vals   :    126.68
+      avg cycles/32 vals   :    127.04
+      float32 throughput   :      2.12 GB/s
+      quantized throughput :      0.28 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
       min cycles/32 vals   :      9.73
-      avg cycles/32 vals   :      9.81
+      avg cycles/32 vals   :      9.79
       float32 throughput   :     25.43 GB/s
       quantized throughput :      3.38 GB/s
 
 bf16
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :     38.62
-      avg cycles/32 vals   :     38.65
-      float32 throughput   :      6.94 GB/s
-      quantized throughput :      3.47 GB/s
+      min cycles/32 vals   :     38.68
+      avg cycles/32 vals   :     38.74
+      float32 throughput   :      7.27 GB/s
+      quantized throughput :      3.63 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     20.81
-      avg cycles/32 vals   :     20.91
+      min cycles/32 vals   :     20.88
+      avg cycles/32 vals   :     20.98
       float32 throughput   :     12.72 GB/s
       quantized throughput :      6.36 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      7.09
-      avg cycles/32 vals   :      7.19
+      min cycles/32 vals   :      7.10
+      avg cycles/32 vals   :      7.22
       float32 throughput   :     38.15 GB/s
       quantized throughput :     19.07 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :     20.81
-      avg cycles/32 vals   :     20.93
+      min cycles/32 vals   :     20.88
+      avg cycles/32 vals   :     21.04
       float32 throughput   :     12.72 GB/s
       quantized throughput :      6.36 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
       min cycles/32 vals   :      5.30
-      avg cycles/32 vals   :      5.37
-      float32 throughput   :     76.29 GB/s
-      quantized throughput :     38.15 GB/s
+      avg cycles/32 vals   :      5.46
+      float32 throughput   :     38.15 GB/s
+      quantized throughput :     19.07 GB/s
 
 tq1_0
   quantize_row_q_reference
     4096 values (0.02 MB)
       min cycles/32 vals   :    262.05
-      avg cycles/32 vals   :    262.32
+      avg cycles/32 vals   :    262.42
       float32 throughput   :      1.04 GB/s
       quantized throughput :      0.05 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :    262.02
-      avg cycles/32 vals   :    262.33
-      float32 throughput   :      1.04 GB/s
-      quantized throughput :      0.05 GB/s
+      min cycles/32 vals   :    262.05
+      avg cycles/32 vals   :    262.40
+      float32 throughput   :      1.05 GB/s
+      quantized throughput :      0.06 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     18.09
-      avg cycles/32 vals   :     18.29
+      min cycles/32 vals   :     17.98
+      avg cycles/32 vals   :     18.31
       float32 throughput   :     15.26 GB/s
       quantized throughput :      0.80 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :    126.68
-      avg cycles/32 vals   :    126.96
-      float32 throughput   :      2.18 GB/s
+      min cycles/32 vals   :    126.83
+      avg cycles/32 vals   :    126.98
+      float32 throughput   :      2.15 GB/s
       quantized throughput :      0.11 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      3.35
-      avg cycles/32 vals   :      3.43
+      min cycles/32 vals   :      3.30
+      avg cycles/32 vals   :      3.41
       float32 throughput   :     76.29 GB/s
       quantized throughput :      4.02 GB/s
 
 tq2_0
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :    264.61
-      avg cycles/32 vals   :    265.22
-      float32 throughput   :      1.04 GB/s
+      min cycles/32 vals   :    264.71
+      avg cycles/32 vals   :    265.10
+      float32 throughput   :      1.02 GB/s
       quantized throughput :      0.07 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :    264.83
-      avg cycles/32 vals   :    265.20
-      float32 throughput   :      1.02 GB/s
+      min cycles/32 vals   :    264.58
+      avg cycles/32 vals   :    265.10
+      float32 throughput   :      1.04 GB/s
       quantized throughput :      0.07 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     13.64
-      avg cycles/32 vals   :     13.89
+      min cycles/32 vals   :     13.66
+      avg cycles/32 vals   :     13.87
       float32 throughput   :     19.07 GB/s
       quantized throughput :      1.23 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :    126.66
-      avg cycles/32 vals   :    126.84
-      float32 throughput   :      2.15 GB/s
+      min cycles/32 vals   :    126.86
+      avg cycles/32 vals   :    127.12
+      float32 throughput   :      2.18 GB/s
       quantized throughput :      0.14 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      1.60
-      avg cycles/32 vals   :      1.72
+      min cycles/32 vals   :      1.56
+      avg cycles/32 vals   :      1.71
       float32 throughput   :     76.29 GB/s
       quantized throughput :      4.92 GB/s
 
 mxfp4
   quantize_row_q_reference
     4096 values (0.02 MB)
-      min cycles/32 vals   :    702.07
-      avg cycles/32 vals   :    702.59
+      min cycles/32 vals   :    701.70
+      avg cycles/32 vals   :    702.39
       float32 throughput   :      0.39 GB/s
       quantized throughput :      0.05 GB/s
 
   quantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :    702.22
-      avg cycles/32 vals   :    721.19
+      min cycles/32 vals   :    701.59
+      avg cycles/32 vals   :    720.28
       float32 throughput   :      0.38 GB/s
       quantized throughput :      0.05 GB/s
 
   dequantize_row_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :     94.14
-      avg cycles/32 vals   :     94.28
+      min cycles/32 vals   :     94.19
+      avg cycles/32 vals   :     94.39
       float32 throughput   :      2.88 GB/s
       quantized throughput :      0.38 GB/s
 
   quantize_row_q_dot
     4096 values (0.02 MB)
-      min cycles/32 vals   :     29.06
-      avg cycles/32 vals   :     29.27
-      float32 throughput   :     10.90 GB/s
-      quantized throughput :      1.45 GB/s
+      min cycles/32 vals   :     29.04
+      avg cycles/32 vals   :     29.31
+      float32 throughput   :      9.54 GB/s
+      quantized throughput :      1.27 GB/s
 
   vec_dot_q
     4096 values (0.02 MB)
-      min cycles/32 vals   :      6.05
-      avg cycles/32 vals   :      6.15
+      min cycles/32 vals   :      6.09
+      avg cycles/32 vals   :      6.18
       float32 throughput   :     38.15 GB/s
       quantized throughput :      5.07 GB/s

Description

  • Loop Unrolling: The loop step is increased from 8 to 32 elements, reducing loop overhead and exposing more instruction-level parallelism.
  • Parallel Accumulators: Instead of accumulating into a single scalar sum, four __m256 vector registers are used to accumulate partial sums in parallel. This breaks the dependency chain on a single accumulator, allowing for better pipeline utilization.
  • Fused Multiply-Add (FMA): The _mm256_fmadd_ps intrinsic is explicitly used for val * val + sum_vec, combining the multiplication and addition into a single instruction, which can improve both throughput and numerical precision.
  • Broadcast mean once: The mean value is broadcast to a __m256 vector once outside the loop (mean_vec), avoiding repeated broadcasts within the loop.
  • Post-loop Reduction: The final horizontal sum of the four parallel accumulators is performed efficiently after the main loop.

References:

  1. Intel - Fast Parallel Reductions with SIMD Instructions:

  2. Intel - Fused Multiply-Add (FMA) Instructions:

  3. Wikipedia - Loop Unrolling:

Co-Authored-By: Gemini 2.5 Pro (References and description commit changes)

…l FMA

-   **Loop Unrolling:** The loop step is increased from 8 to 32 elements, reducing loop overhead and exposing more instruction-level parallelism.
-   **Parallel Accumulators:** Instead of accumulating into a single scalar sum, four `__m256` vector registers are used to accumulate partial sums in parallel. This breaks the dependency chain on a single accumulator, allowing for better pipeline utilization.
-   **Fused Multiply-Add (FMA):** The `_mm256_fmadd_ps` intrinsic is explicitly used for `val * val + sum_vec`, combining the multiplication and addition into a single instruction, which can improve both throughput and numerical precision.
-   **Broadcast `mean` once:** The `mean` value is broadcast to a `__m256` vector once outside the loop (`mean_vec`), avoiding repeated broadcasts within the loop.
-   **Post-loop Reduction:** The final horizontal sum of the four parallel accumulators is performed efficiently after the main loop.

References:
1.  **Intel - Fast Parallel Reductions with SIMD Instructions:**
    *   Explains the concept of using multiple accumulators for parallel reduction, directly relevant to this optimization.
    *   Link: [https://software.intel.com/content/www/us/en/develop/articles/fast-parallel-reductions-with-simd-instructions.html](https://software.intel.com/content/www/us/en/develop/articles/fast-parallel-reductions-with-simd-instructions.html)

2.  **Intel - Fused Multiply-Add (FMA) Instructions:**
    *   Details the benefits and usage of FMA instructions for improved performance and precision.
    *   Link: [https://software.intel.com/content/www/us/en/develop/articles/fused-multiply-add-fma-instructions.html](https://software.intel.com/content/www/us/en/develop/articles/fused-multiply-add-fma-instructions.html)

3.  **Wikipedia - Loop Unrolling:**
    *   A general explanation of loop unrolling, a fundamental optimization technique used here.
    *   Link: [https://en.wikipedia.org/wiki/Loop_unrolling](https://en.wikipedia.org/wiki/Loop_unrolling)
@GermanAizek GermanAizek changed the title vec: optimize AVX2/FMA sum-of-squares with loop unrolling vec: optimize AVX2/FMA sum-of-squares with loop unrolling and FMA Dec 1, 2025
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 1, 2025
@pwilkin pwilkin added the vibe-coded Created with heavy use of LLM assistants, requires human verification label Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning vibe-coded Created with heavy use of LLM assistants, requires human verification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants