Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: refactor mmq, dmmv, mmvq #7716

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented Jun 3, 2024

This PR refactors the mul_mat_q and to a lesser extent the dequantize_mul_mat_vec and mul_mat_vec_q kernels. The intent is to simplify the code in preparation for #7676 . List of changes:

  • Instead of passing a set of template arguments (qk, qr, block type, some kernel, ...) to the CUDA kernels, pass the ggml_type of the buffer to the CUDA kernel and then use constexpr __device__ functions to fetch the corresponding arguments at compile time. This simplifies the use of the template and ensures that arguments are passed consistently (without e.g. accidentally not changing one of them when copy-pasting). Passing only a ggml_type for a template argument also solves a lot of annoyances to do with device functions in host code; I think it's greatly preferable to separate the two as much as possible.
  • Add missing constants for iq1_m and iq3_s (previously in the CUDA code constants for other data types with the same value were used).
  • MMQ: Remove functions for allocating tiles and instead provide functions for calculating the sizes/offsets of tiles for a given ggml_type. The shared memory is then allocated dynamically and the pointers to shared memory can be set generically.
  • MMQ: Move more implementation detail/loops from mul_mat_q into the functions for loading tiles and calculating dot products. This is so a given __CUDA_ARCH__ can more easily use tensor cores without having to touch the rest of the code.
  • MMQ: Instead of hard-coded tile sizes, precompile multiple tile sizes and at runtime dynamically choose one of them. The selection algorithm starts with the smallest tile and increases the tile size as long as this results in fewer waves.
  • MMQ: Add an additional argument to the CUDA kernel to (in principle) allow for permuted input matrices.
  • MMQ: Compile with extern templates for faster multi-threaded compilation.
  • MMQ: Increase maximum batch size for use in the presence of tensor cores from 32 to 64.

The performance of MMQ changes due to the dynamic tile sizes. The tile sizes on master were tuned for the same GPUs on LLaMA 2 7b, the following numbers are for LLaMA 3 8b. For context, the optimal tile sizes for a matrix multiplication depend strongly on the matrix shapes and the number of streaming multiprocessors on a GPU. So the numbers on master are likely "overfit" to these specific model and GPU combinations and I think the relative performance change on GPUs that I do not have access to will be better than what I report in this PR.

On my GPUs the performance for small batch sizes becomes significantly better. The performance for large batch sizes on NVIDIA GPUs stay essentially constant on average, the performance for my RX 6800 gets worse for large batch sizes for some quants. I don't understand why this is happening and I cannot do a git bisect either because my WIP commits only work correctly for q8_0 where there is no performance regression. My fundamental stance is that I am fine with supporting AMD via HIP as long as it's not too much effort. And in this case trying to figure out the exact problem is too much effort so I will not do it.

Specific numbers:

RTX 3090, no LLAMA_CUDA_FORCE_MMQ
GPU Model Microbatch size Split mode Test t/s master t/s PR Speedup
RTX 3090 llama 8B Q2_K_M 16 layer pp2048 392.47 481.69 1.23
RTX 3090 llama 8B Q2_K_M 32 layer pp2048 423.95 579.88 1.37
RTX 3090 llama 8B Q2_K_M 64 layer pp2048 1299.29 858.74 0.66
RTX 3090 llama 8B Q3_K_S 16 layer pp2048 249.45 393.17 1.58
RTX 3090 llama 8B Q3_K_S 32 layer pp2048 267.04 504.87 1.89
RTX 3090 llama 8B Q3_K_S 64 layer pp2048 1194.91 806.45 0.67
RTX 3090 llama 8B Q4_0 16 layer pp2048 591.05 914.10 1.55
RTX 3090 llama 8B Q4_0 32 layer pp2048 630.19 1197.13 1.90
RTX 3090 llama 8B Q4_0 64 layer pp2048 1367.19 1637.67 1.20
RTX 3090 llama 8B Q4_1 16 layer pp2048 622.09 886.10 1.42
RTX 3090 llama 8B Q4_1 32 layer pp2048 673.10 1150.15 1.71
RTX 3090 llama 8B Q4_1 64 layer pp2048 1340.03 1492.12 1.11
RTX 3090 llama 8B Q4_K_S 16 layer pp2048 393.77 697.46 1.77
RTX 3090 llama 8B Q4_K_S 32 layer pp2048 413.01 923.26 2.24
RTX 3090 llama 8B Q4_K_S 64 layer pp2048 1308.25 1392.56 1.06
RTX 3090 llama 8B Q5_0 16 layer pp2048 358.51 618.33 1.72
RTX 3090 llama 8B Q5_0 32 layer pp2048 395.60 844.33 2.13
RTX 3090 llama 8B Q5_0 64 layer pp2048 1262.56 1246.41 0.99
RTX 3090 llama 8B Q5_1 16 layer pp2048 416.17 676.59 1.63
RTX 3090 llama 8B Q5_1 32 layer pp2048 454.51 861.92 1.90
RTX 3090 llama 8B Q5_1 64 layer pp2048 1251.70 1232.35 0.98
RTX 3090 llama 8B Q5_K_S 16 layer pp2048 309.30 563.72 1.82
RTX 3090 llama 8B Q5_K_S 32 layer pp2048 325.13 800.90 2.46
RTX 3090 llama 8B Q5_K_S 64 layer pp2048 1276.73 1230.66 0.96
RTX 3090 llama 8B Q6_K 16 layer pp2048 289.82 535.79 1.85
RTX 3090 llama 8B Q6_K 32 layer pp2048 305.62 781.51 2.56
RTX 3090 llama 8B Q6_K 64 layer pp2048 1238.65 1184.03 0.96
RTX 3090 llama 8B Q8_0 16 layer pp2048 351.56 563.40 1.60
RTX 3090 llama 8B Q8_0 32 layer pp2048 369.43 899.86 2.44
RTX 3090 llama 8B Q8_0 64 layer pp2048 1259.81 1306.82 1.04
RTX 4090, no LLAMA_CUDA_FORCE_MMQ
GPU Model Microbatch size Split mode Test t/s master t/s PR Speedup
RTX 4090 llama 8B Q2_K_M 16 layer pp2048 972.80 1088.45 1.12
RTX 4090 llama 8B Q2_K_M 32 layer pp2048 1092.77 1534.41 1.40
RTX 4090 llama 8B Q2_K_M 64 layer pp2048 1884.49 1986.58 1.05
RTX 4090 llama 8B Q3_K_S 16 layer pp2048 620.53 879.70 1.42
RTX 4090 llama 8B Q3_K_S 32 layer pp2048 692.28 1320.28 1.91
RTX 4090 llama 8B Q3_K_S 64 layer pp2048 1833.09 1832.11 1.00
RTX 4090 llama 8B Q4_0 16 layer pp2048 1342.75 1570.00 1.17
RTX 4090 llama 8B Q4_0 32 layer pp2048 1562.30 2559.30 1.64
RTX 4090 llama 8B Q4_0 64 layer pp2048 1804.73 3612.61 2.00
RTX 4090 llama 8B Q4_1 16 layer pp2048 1420.15 1539.71 1.08
RTX 4090 llama 8B Q4_1 32 layer pp2048 1695.67 2518.34 1.49
RTX 4090 llama 8B Q4_1 64 layer pp2048 1786.66 3588.29 2.01
RTX 4090 llama 8B Q4_K_S 16 layer pp2048 979.09 1371.60 1.40
RTX 4090 llama 8B Q4_K_S 32 layer pp2048 1099.34 2303.08 2.09
RTX 4090 llama 8B Q4_K_S 64 layer pp2048 1798.72 3066.75 1.70
RTX 4090 llama 8B Q5_0 16 layer pp2048 745.75 1240.81 1.66
RTX 4090 llama 8B Q5_0 32 layer pp2048 981.11 1948.28 1.99
RTX 4090 llama 8B Q5_0 64 layer pp2048 1767.57 2859.05 1.62
RTX 4090 llama 8B Q5_1 16 layer pp2048 905.93 1283.41 1.42
RTX 4090 llama 8B Q5_1 32 layer pp2048 1141.76 2099.30 1.84
RTX 4090 llama 8B Q5_1 64 layer pp2048 1772.44 2925.59 1.65
RTX 4090 llama 8B Q5_K_S 16 layer pp2048 775.69 1173.09 1.51
RTX 4090 llama 8B Q5_K_S 32 layer pp2048 863.19 1867.92 2.16
RTX 4090 llama 8B Q5_K_S 64 layer pp2048 1798.36 2645.96 1.47
RTX 4090 llama 8B Q6_K 16 layer pp2048 735.30 1081.85 1.47
RTX 4090 llama 8B Q6_K 32 layer pp2048 843.60 1799.98 2.13
RTX 4090 llama 8B Q6_K 64 layer pp2048 1756.18 2534.44 1.44
RTX 4090 llama 8B Q8_0 16 layer pp2048 811.09 1025.54 1.26
RTX 4090 llama 8B Q8_0 32 layer pp2048 1012.77 1757.51 1.74
RTX 4090 llama 8B Q8_0 64 layer pp2048 1716.77 2919.26 1.70
2x RTX 4090 llama 8B Q2_K_M 16 layer pp2048 1776.51 1976.18 1.11
2x RTX 4090 llama 8B Q2_K_M 16 row pp2048 585.16 545.31 0.93
2x RTX 4090 llama 8B Q2_K_M 32 layer pp2048 1985.83 2763.69 1.39
2x RTX 4090 llama 8B Q2_K_M 32 row pp2048 841.47 867.13 1.03
2x RTX 4090 llama 8B Q2_K_M 64 layer pp2048 3321.82 3531.51 1.06
2x RTX 4090 llama 8B Q2_K_M 64 row pp2048 821.57 1214.47 1.48
2x RTX 4090 llama 8B Q3_K_S 16 layer pp2048 1134.49 1600.03 1.41
2x RTX 4090 llama 8B Q3_K_S 16 row pp2048 481.46 497.39 1.03
2x RTX 4090 llama 8B Q3_K_S 32 layer pp2048 1257.66 2384.14 1.90
2x RTX 4090 llama 8B Q3_K_S 32 row pp2048 675.35 792.48 1.17
2x RTX 4090 llama 8B Q3_K_S 64 layer pp2048 3240.87 3266.84 1.01
2x RTX 4090 llama 8B Q3_K_S 64 row pp2048 814.73 1141.26 1.40
2x RTX 4090 llama 8B Q4_0 16 layer pp2048 2433.17 2847.72 1.17
2x RTX 4090 llama 8B Q4_0 16 row pp2048 649.30 618.31 0.95
2x RTX 4090 llama 8B Q4_0 32 layer pp2048 2819.08 4561.44 1.62
2x RTX 4090 llama 8B Q4_0 32 row pp2048 963.94 1000.48 1.04
2x RTX 4090 llama 8B Q4_0 64 layer pp2048 3190.30 6354.50 1.99
2x RTX 4090 llama 8B Q4_0 64 row pp2048 788.69 1447.09 1.83
2x RTX 4090 llama 8B Q4_1 16 layer pp2048 2572.38 2781.84 1.08
2x RTX 4090 llama 8B Q4_1 16 row pp2048 665.93 626.05 0.94
2x RTX 4090 llama 8B Q4_1 32 layer pp2048 3072.93 4491.90 1.46
2x RTX 4090 llama 8B Q4_1 32 row pp2048 994.15 1007.77 1.01
2x RTX 4090 llama 8B Q4_1 64 layer pp2048 3166.63 6314.96 1.99
2x RTX 4090 llama 8B Q4_1 64 row pp2048 786.60 1447.73 1.84
2x RTX 4090 llama 8B Q4_K_S 16 layer pp2048 1760.69 2471.71 1.40
2x RTX 4090 llama 8B Q4_K_S 16 row pp2048 605.61 610.61 1.01
2x RTX 4090 llama 8B Q4_K_S 32 layer pp2048 1964.66 4103.54 2.09
2x RTX 4090 llama 8B Q4_K_S 32 row pp2048 842.56 964.23 1.14
2x RTX 4090 llama 8B Q4_K_S 64 layer pp2048 3196.31 5379.54 1.68
2x RTX 4090 llama 8B Q4_K_S 64 row pp2048 809.88 1395.53 1.72
2x RTX 4090 llama 8B Q5_0 16 layer pp2048 1358.29 2250.33 1.66
2x RTX 4090 llama 8B Q5_0 16 row pp2048 490.87 570.23 1.16
2x RTX 4090 llama 8B Q5_0 32 layer pp2048 1780.97 3494.86 1.96
2x RTX 4090 llama 8B Q5_0 32 row pp2048 762.41 923.69 1.21
2x RTX 4090 llama 8B Q5_0 64 layer pp2048 3117.11 5052.16 1.62
2x RTX 4090 llama 8B Q5_0 64 row pp2048 815.07 1334.39 1.64
2x RTX 4090 llama 8B Q5_1 16 layer pp2048 1658.78 2332.96 1.41
2x RTX 4090 llama 8B Q5_1 16 row pp2048 544.71 584.50 1.07
2x RTX 4090 llama 8B Q5_1 32 layer pp2048 2070.69 3782.95 1.83
2x RTX 4090 llama 8B Q5_1 32 row pp2048 834.43 931.89 1.12
2x RTX 4090 llama 8B Q5_1 64 layer pp2048 3150.75 5176.20 1.64
2x RTX 4090 llama 8B Q5_1 64 row pp2048 811.91 1360.48 1.68
2x RTX 4090 llama 8B Q5_K_S 16 layer pp2048 1407.27 2139.09 1.52
2x RTX 4090 llama 8B Q5_K_S 16 row pp2048 543.15 558.39 1.03
2x RTX 4090 llama 8B Q5_K_S 32 layer pp2048 1559.03 3372.95 2.16
2x RTX 4090 llama 8B Q5_K_S 32 row pp2048 750.53 897.55 1.20
2x RTX 4090 llama 8B Q5_K_S 64 layer pp2048 3176.75 4688.08 1.48
2x RTX 4090 llama 8B Q5_K_S 64 row pp2048 811.82 1301.78 1.60
2x RTX 4090 llama 8B Q6_K 16 layer pp2048 1339.50 1960.15 1.46
2x RTX 4090 llama 8B Q6_K 16 row pp2048 517.41 552.49 1.07
2x RTX 4090 llama 8B Q6_K 32 layer pp2048 1529.44 3238.60 2.12
2x RTX 4090 llama 8B Q6_K 32 row pp2048 742.92 884.11 1.19
2x RTX 4090 llama 8B Q6_K 64 layer pp2048 3118.71 4485.22 1.44
2x RTX 4090 llama 8B Q6_K 64 row pp2048 808.57 1301.09 1.61
2x RTX 4090 llama 8B Q8_0 16 layer pp2048 1470.05 1863.42 1.27
2x RTX 4090 llama 8B Q8_0 16 row pp2048 552.37 550.86 1.00
2x RTX 4090 llama 8B Q8_0 32 layer pp2048 1833.58 3170.27 1.73
2x RTX 4090 llama 8B Q8_0 32 row pp2048 811.99 911.36 1.12
2x RTX 4090 llama 8B Q8_0 64 layer pp2048 3019.69 5176.23 1.71
2x RTX 4090 llama 8B Q8_0 64 row pp2048 800.14 1300.50 1.63
RTX 3090, LLAMA_CUDA_FORCE_MMQ
GPU Model Model Size [GiB] Microbatch size Test t/s master t/s PR Speedup
RTX 3090 llama 8B Q2_K_M 2.95 16 pp2048 163.29 494.78 3.03
RTX 3090 llama 8B Q2_K_M 2.95 32 pp2048 321.88 584.64 1.82
RTX 3090 llama 8B Q2_K_M 2.95 64 pp2048 631.37 854.42 1.35
RTX 3090 llama 8B Q2_K_M 2.95 128 pp2048 1028.19 1093.64 1.06
RTX 3090 llama 8B Q2_K_M 2.95 256 pp2048 1413.48 1397.28 0.99
RTX 3090 llama 8B Q2_K_M 2.95 512 pp2048 1570.23 1433.82 0.91
RTX 3090 llama 8B Q2_K_M 2.95 1024 pp2048 1601.62 1486.84 0.93
RTX 3090 llama 8B Q2_K_M 2.95 2048 pp2048 1630.83 1500.23 0.92
RTX 3090 llama 8B Q3_K_S 3.41 16 pp2048 138.61 393.95 2.84
RTX 3090 llama 8B Q3_K_S 3.41 32 pp2048 274.15 505.36 1.84
RTX 3090 llama 8B Q3_K_S 3.41 64 pp2048 536.23 808.78 1.51
RTX 3090 llama 8B Q3_K_S 3.41 128 pp2048 995.17 1128.29 1.13
RTX 3090 llama 8B Q3_K_S 3.41 256 pp2048 1602.63 1490.57 0.93
RTX 3090 llama 8B Q3_K_S 3.41 512 pp2048 1807.43 1542.94 0.85
RTX 3090 llama 8B Q3_K_S 3.41 1024 pp2048 1929.11 1605.67 0.83
RTX 3090 llama 8B Q3_K_S 3.41 2048 pp2048 1986.19 1622.01 0.82
RTX 3090 llama 8B Q4_0 4.33 16 pp2048 333.26 917.09 2.75
RTX 3090 llama 8B Q4_0 4.33 32 pp2048 641.48 1191.38 1.86
RTX 3090 llama 8B Q4_0 4.33 64 pp2048 1232.03 1644.34 1.33
RTX 3090 llama 8B Q4_0 4.33 128 pp2048 1932.77 1922.94 0.99
RTX 3090 llama 8B Q4_0 4.33 256 pp2048 2330.31 2369.79 1.02
RTX 3090 llama 8B Q4_0 4.33 512 pp2048 2499.28 2480.35 0.99
RTX 3090 llama 8B Q4_0 4.33 1024 pp2048 2640.85 2532.00 0.96
RTX 3090 llama 8B Q4_0 4.33 2048 pp2048 2635.71 2470.43 0.94
RTX 3090 llama 8B Q4_1 4.77 16 pp2048 251.07 887.12 3.53
RTX 3090 llama 8B Q4_1 4.77 32 pp2048 488.82 1167.72 2.39
RTX 3090 llama 8B Q4_1 4.77 64 pp2048 943.81 1512.42 1.60
RTX 3090 llama 8B Q4_1 4.77 128 pp2048 1494.59 1835.42 1.23
RTX 3090 llama 8B Q4_1 4.77 256 pp2048 1952.65 2191.26 1.12
RTX 3090 llama 8B Q4_1 4.77 512 pp2048 2116.32 2291.76 1.08
RTX 3090 llama 8B Q4_1 4.77 1024 pp2048 2202.68 2363.79 1.07
RTX 3090 llama 8B Q4_1 4.77 2048 pp2048 2218.74 2315.95 1.04
RTX 3090 llama 8B Q4_K_S 4.36 16 pp2048 252.31 694.73 2.75
RTX 3090 llama 8B Q4_K_S 4.36 32 pp2048 491.57 912.68 1.86
RTX 3090 llama 8B Q4_K_S 4.36 64 pp2048 951.84 1381.44 1.45
RTX 3090 llama 8B Q4_K_S 4.36 128 pp2048 1565.50 1625.07 1.04
RTX 3090 llama 8B Q4_K_S 4.36 256 pp2048 1946.74 2014.97 1.04
RTX 3090 llama 8B Q4_K_S 4.36 512 pp2048 2146.68 2131.84 0.99
RTX 3090 llama 8B Q4_K_S 4.36 1024 pp2048 2250.37 2195.10 0.98
RTX 3090 llama 8B Q4_K_S 4.36 2048 pp2048 2284.91 2154.20 0.94
RTX 3090 llama 8B Q5_0 5.21 16 pp2048 242.67 626.85 2.58
RTX 3090 llama 8B Q5_0 5.21 32 pp2048 472.40 867.40 1.84
RTX 3090 llama 8B Q5_0 5.21 64 pp2048 915.94 1266.94 1.38
RTX 3090 llama 8B Q5_0 5.21 128 pp2048 1647.49 1648.93 1.00
RTX 3090 llama 8B Q5_0 5.21 256 pp2048 1994.59 2089.43 1.05
RTX 3090 llama 8B Q5_0 5.21 512 pp2048 2106.90 2187.50 1.04
RTX 3090 llama 8B Q5_0 5.21 1024 pp2048 2205.17 2267.47 1.03
RTX 3090 llama 8B Q5_0 5.21 2048 pp2048 2210.19 2235.98 1.01
RTX 3090 llama 8B Q5_1 5.64 16 pp2048 186.32 680.91 3.65
RTX 3090 llama 8B Q5_1 5.64 32 pp2048 366.99 867.07 2.36
RTX 3090 llama 8B Q5_1 5.64 64 pp2048 716.15 1259.08 1.76
RTX 3090 llama 8B Q5_1 5.64 128 pp2048 1314.52 1605.30 1.22
RTX 3090 llama 8B Q5_1 5.64 256 pp2048 1739.79 2026.19 1.16
RTX 3090 llama 8B Q5_1 5.64 512 pp2048 1844.11 2132.63 1.16
RTX 3090 llama 8B Q5_1 5.64 1024 pp2048 1945.92 2217.82 1.14
RTX 3090 llama 8B Q5_1 5.64 2048 pp2048 1974.16 2156.70 1.09
RTX 3090 llama 8B Q5_K_S 5.21 16 pp2048 225.74 564.43 2.50
RTX 3090 llama 8B Q5_K_S 5.21 32 pp2048 442.22 802.98 1.82
RTX 3090 llama 8B Q5_K_S 5.21 64 pp2048 859.86 1235.84 1.44
RTX 3090 llama 8B Q5_K_S 5.21 128 pp2048 1423.29 1460.88 1.03
RTX 3090 llama 8B Q5_K_S 5.21 256 pp2048 1799.34 1925.35 1.07
RTX 3090 llama 8B Q5_K_S 5.21 512 pp2048 1994.80 2026.59 1.02
RTX 3090 llama 8B Q5_K_S 5.21 1024 pp2048 2102.12 2093.29 1.00
RTX 3090 llama 8B Q5_K_S 5.21 2048 pp2048 2118.11 2050.40 0.97
RTX 3090 llama 8B Q6_K 6.14 16 pp2048 368.95 534.24 1.45
RTX 3090 llama 8B Q6_K 6.14 32 pp2048 710.40 775.39 1.09
RTX 3090 llama 8B Q6_K 6.14 64 pp2048 1338.74 1181.64 0.88
RTX 3090 llama 8B Q6_K 6.14 128 pp2048 1601.37 1447.61 0.90
RTX 3090 llama 8B Q6_K 6.14 256 pp2048 1858.69 1885.03 1.01
RTX 3090 llama 8B Q6_K 6.14 512 pp2048 1939.22 1985.76 1.02
RTX 3090 llama 8B Q6_K 6.14 1024 pp2048 1956.11 2045.91 1.05
RTX 3090 llama 8B Q6_K 6.14 2048 pp2048 1960.44 2008.68 1.02
RTX 3090 llama 8B Q8_0 7.95 16 pp2048 262.05 568.95 2.17
RTX 3090 llama 8B Q8_0 7.95 32 pp2048 513.85 920.05 1.79
RTX 3090 llama 8B Q8_0 7.95 64 pp2048 999.08 1330.77 1.33
RTX 3090 llama 8B Q8_0 7.95 128 pp2048 1856.45 1715.81 0.92
RTX 3090 llama 8B Q8_0 7.95 256 pp2048 2191.89 2170.14 0.99
RTX 3090 llama 8B Q8_0 7.95 512 pp2048 2294.07 2302.60 1.00
RTX 3090 llama 8B Q8_0 7.95 1024 pp2048 2407.52 2355.60 0.98
RTX 3090 llama 8B Q8_0 7.95 2048 pp2048 2400.76 2287.24 0.95
RTX 4090, LLAMA_CUDA_FORCE_MMQ
GPU Model Model Size [GiB] Microbatch size Test t/s master t/s PR Speedup
RTX 4090 llama 8B Q2_K_M 2.95 16 pp2048 280.30 1087.52 3.88
RTX 4090 llama 8B Q2_K_M 2.95 32 pp2048 554.20 1533.46 2.77
RTX 4090 llama 8B Q2_K_M 2.95 64 pp2048 1088.36 1979.80 1.82
RTX 4090 llama 8B Q2_K_M 2.95 128 pp2048 1850.41 2793.72 1.51
RTX 4090 llama 8B Q2_K_M 2.95 256 pp2048 2932.68 3467.81 1.18
RTX 4090 llama 8B Q2_K_M 2.95 512 pp2048 4121.92 3887.50 0.94
RTX 4090 llama 8B Q2_K_M 2.95 1024 pp2048 4289.52 4045.93 0.94
RTX 4090 llama 8B Q2_K_M 2.95 2048 pp2048 4165.25 3878.64 0.93
RTX 4090 llama 8B Q3_K_S 3.41 16 pp2048 239.53 879.02 3.67
RTX 4090 llama 8B Q3_K_S 3.41 32 pp2048 474.28 1320.06 2.78
RTX 4090 llama 8B Q3_K_S 3.41 64 pp2048 933.42 1829.11 1.96
RTX 4090 llama 8B Q3_K_S 3.41 128 pp2048 1779.35 2766.25 1.55
RTX 4090 llama 8B Q3_K_S 3.41 256 pp2048 3034.65 3539.88 1.17
RTX 4090 llama 8B Q3_K_S 3.41 512 pp2048 4482.77 4146.28 0.92
RTX 4090 llama 8B Q3_K_S 3.41 1024 pp2048 5073.63 4344.08 0.86
RTX 4090 llama 8B Q3_K_S 3.41 2048 pp2048 4995.96 4197.65 0.84
RTX 4090 llama 8B Q4_0 4.33 16 pp2048 535.29 1569.22 2.93
RTX 4090 llama 8B Q4_0 4.33 32 pp2048 1048.60 2545.21 2.43
RTX 4090 llama 8B Q4_0 4.33 64 pp2048 2030.56 3606.46 1.78
RTX 4090 llama 8B Q4_0 4.33 128 pp2048 3495.03 4847.14 1.39
RTX 4090 llama 8B Q4_0 4.33 256 pp2048 5360.22 5749.91 1.07
RTX 4090 llama 8B Q4_0 4.33 512 pp2048 6421.53 6254.92 0.97
RTX 4090 llama 8B Q4_0 4.33 1024 pp2048 6481.77 6417.54 0.99
RTX 4090 llama 8B Q4_0 4.33 2048 pp2048 6229.96 6048.51 0.97
RTX 4090 llama 8B Q4_1 4.77 16 pp2048 390.13 1539.53 3.95
RTX 4090 llama 8B Q4_1 4.77 32 pp2048 767.93 2507.41 3.27
RTX 4090 llama 8B Q4_1 4.77 64 pp2048 1495.66 3593.41 2.40
RTX 4090 llama 8B Q4_1 4.77 128 pp2048 2739.88 4678.62 1.71
RTX 4090 llama 8B Q4_1 4.77 256 pp2048 4252.17 5516.40 1.30
RTX 4090 llama 8B Q4_1 4.77 512 pp2048 5508.16 5860.44 1.06
RTX 4090 llama 8B Q4_1 4.77 1024 pp2048 5578.34 6003.53 1.08
RTX 4090 llama 8B Q4_1 4.77 2048 pp2048 5405.53 5688.96 1.05
RTX 4090 llama 8B Q4_K_S 4.36 16 pp2048 428.54 1373.54 3.21
RTX 4090 llama 8B Q4_K_S 4.36 32 pp2048 843.14 2301.87 2.73
RTX 4090 llama 8B Q4_K_S 4.36 64 pp2048 1637.37 3069.63 1.87
RTX 4090 llama 8B Q4_K_S 4.36 128 pp2048 2879.95 4390.96 1.52
RTX 4090 llama 8B Q4_K_S 4.36 256 pp2048 4506.80 5101.72 1.13
RTX 4090 llama 8B Q4_K_S 4.36 512 pp2048 5623.80 5514.66 0.98
RTX 4090 llama 8B Q4_K_S 4.36 1024 pp2048 5750.54 5670.15 0.99
RTX 4090 llama 8B Q4_K_S 4.36 2048 pp2048 5586.98 5423.83 0.97
RTX 4090 llama 8B Q5_0 5.21 16 pp2048 434.91 1238.82 2.85
RTX 4090 llama 8B Q5_0 5.21 32 pp2048 853.18 1941.41 2.28
RTX 4090 llama 8B Q5_0 5.21 64 pp2048 1657.11 2859.59 1.73
RTX 4090 llama 8B Q5_0 5.21 128 pp2048 3063.92 4117.36 1.34
RTX 4090 llama 8B Q5_0 5.21 256 pp2048 4653.62 5045.57 1.08
RTX 4090 llama 8B Q5_0 5.21 512 pp2048 5542.35 5703.05 1.03
RTX 4090 llama 8B Q5_0 5.21 1024 pp2048 5608.58 5876.11 1.05
RTX 4090 llama 8B Q5_0 5.21 2048 pp2048 5408.58 5568.99 1.03
RTX 4090 llama 8B Q5_1 5.64 16 pp2048 339.78 1281.60 3.77
RTX 4090 llama 8B Q5_1 5.64 32 pp2048 668.74 2099.77 3.14
RTX 4090 llama 8B Q5_1 5.64 64 pp2048 1312.16 2929.72 2.23
RTX 4090 llama 8B Q5_1 5.64 128 pp2048 2492.66 4123.10 1.65
RTX 4090 llama 8B Q5_1 5.64 256 pp2048 3850.62 4949.12 1.29
RTX 4090 llama 8B Q5_1 5.64 512 pp2048 4929.19 5510.18 1.12
RTX 4090 llama 8B Q5_1 5.64 1024 pp2048 5028.96 5713.52 1.14
RTX 4090 llama 8B Q5_1 5.64 2048 pp2048 4887.91 5460.78 1.12
RTX 4090 llama 8B Q5_K_S 5.21 16 pp2048 375.48 1177.82 3.14
RTX 4090 llama 8B Q5_K_S 5.21 32 pp2048 737.52 1872.81 2.54
RTX 4090 llama 8B Q5_K_S 5.21 64 pp2048 1440.87 2649.23 1.84
RTX 4090 llama 8B Q5_K_S 5.21 128 pp2048 2561.62 3950.98 1.54
RTX 4090 llama 8B Q5_K_S 5.21 256 pp2048 4068.86 4706.04 1.16
RTX 4090 llama 8B Q5_K_S 5.21 512 pp2048 5186.66 5170.25 1.00
RTX 4090 llama 8B Q5_K_S 5.21 1024 pp2048 5364.80 5377.56 1.00
RTX 4090 llama 8B Q5_K_S 5.21 2048 pp2048 5256.99 5128.71 0.98
RTX 4090 llama 8B Q6_K 6.14 16 pp2048 642.89 1080.54 1.68
RTX 4090 llama 8B Q6_K 6.14 32 pp2048 1243.09 1797.63 1.45
RTX 4090 llama 8B Q6_K 6.14 64 pp2048 2389.93 2534.41 1.06
RTX 4090 llama 8B Q6_K 6.14 128 pp2048 3760.59 3801.63 1.01
RTX 4090 llama 8B Q6_K 6.14 256 pp2048 4677.94 4535.79 0.97
RTX 4090 llama 8B Q6_K 6.14 512 pp2048 4999.82 5063.72 1.01
RTX 4090 llama 8B Q6_K 6.14 1024 pp2048 4978.01 5224.71 1.05
RTX 4090 llama 8B Q6_K 6.14 2048 pp2048 4816.73 4956.96 1.03
RTX 4090 llama 8B Q8_0 7.95 16 pp2048 483.67 1023.58 2.12
RTX 4090 llama 8B Q8_0 7.95 32 pp2048 950.71 1756.94 1.85
RTX 4090 llama 8B Q8_0 7.95 64 pp2048 1857.06 2921.49 1.57
RTX 4090 llama 8B Q8_0 7.95 128 pp2048 3494.25 4172.97 1.19
RTX 4090 llama 8B Q8_0 7.95 256 pp2048 5172.22 5211.72 1.01
RTX 4090 llama 8B Q8_0 7.95 512 pp2048 6046.40 5790.95 0.96
RTX 4090 llama 8B Q8_0 7.95 1024 pp2048 6129.79 6006.39 0.98
RTX 4090 llama 8B Q8_0 7.95 2048 pp2048 5879.63 5726.80 0.97
P40
GPU Model Model Size [GiB] Microbatch size Test t/s master t/s PR Speedup
P40 llama 8B Q2_K_M 2.95 16 pp2048 113.84 235.20 2.07
P40 llama 8B Q2_K_M 2.95 32 pp2048 222.13 340.32 1.53
P40 llama 8B Q2_K_M 2.95 64 pp2048 401.56 399.45 0.99
P40 llama 8B Q2_K_M 2.95 128 pp2048 483.67 484.29 1.00
P40 llama 8B Q2_K_M 2.95 256 pp2048 541.16 537.77 0.99
P40 llama 8B Q2_K_M 2.95 512 pp2048 569.73 567.98 1.00
P40 llama 8B Q2_K_M 2.95 1024 pp2048 567.92 563.82 0.99
P40 llama 8B Q2_K_M 2.95 2048 pp2048 548.21 547.31 1.00
P40 llama 8B Q3_K_S 3.41 16 pp2048 110.40 190.60 1.73
P40 llama 8B Q3_K_S 3.41 32 pp2048 214.51 304.42 1.42
P40 llama 8B Q3_K_S 3.41 64 pp2048 386.80 380.14 0.98
P40 llama 8B Q3_K_S 3.41 128 pp2048 469.63 467.09 0.99
P40 llama 8B Q3_K_S 3.41 256 pp2048 524.10 518.51 0.99
P40 llama 8B Q3_K_S 3.41 512 pp2048 555.87 548.34 0.99
P40 llama 8B Q3_K_S 3.41 1024 pp2048 550.70 544.71 0.99
P40 llama 8B Q3_K_S 3.41 2048 pp2048 531.90 529.43 1.00
P40 llama 8B Q4_0 4.33 16 pp2048 189.25 441.39 2.33
P40 llama 8B Q4_0 4.33 32 pp2048 363.00 611.04 1.68
P40 llama 8B Q4_0 4.33 64 pp2048 621.79 650.57 1.05
P40 llama 8B Q4_0 4.33 128 pp2048 736.06 766.34 1.04
P40 llama 8B Q4_0 4.33 256 pp2048 825.19 853.66 1.03
P40 llama 8B Q4_0 4.33 512 pp2048 866.07 893.46 1.03
P40 llama 8B Q4_0 4.33 1024 pp2048 852.91 875.17 1.03
P40 llama 8B Q4_0 4.33 2048 pp2048 815.52 835.49 1.02
P40 llama 8B Q4_1 4.77 16 pp2048 194.52 440.69 2.27
P40 llama 8B Q4_1 4.77 32 pp2048 371.19 536.30 1.44
P40 llama 8B Q4_1 4.77 64 pp2048 634.44 640.60 1.01
P40 llama 8B Q4_1 4.77 128 pp2048 741.06 757.04 1.02
P40 llama 8B Q4_1 4.77 256 pp2048 830.28 840.92 1.01
P40 llama 8B Q4_1 4.77 512 pp2048 868.55 876.99 1.01
P40 llama 8B Q4_1 4.77 1024 pp2048 854.25 863.79 1.01
P40 llama 8B Q4_1 4.77 2048 pp2048 815.03 823.34 1.01
P40 llama 8B Q4_K_S 4.36 16 pp2048 179.33 394.10 2.20
P40 llama 8B Q4_K_S 4.36 32 pp2048 344.26 494.92 1.44
P40 llama 8B Q4_K_S 4.36 64 pp2048 594.06 596.95 1.00
P40 llama 8B Q4_K_S 4.36 128 pp2048 697.22 703.14 1.01
P40 llama 8B Q4_K_S 4.36 256 pp2048 769.97 772.38 1.00
P40 llama 8B Q4_K_S 4.36 512 pp2048 803.60 808.76 1.01
P40 llama 8B Q4_K_S 4.36 1024 pp2048 790.52 794.43 1.00
P40 llama 8B Q4_K_S 4.36 2048 pp2048 750.20 755.38 1.01
P40 llama 8B Q5_0 5.21 16 pp2048 176.80 335.92 1.90
P40 llama 8B Q5_0 5.21 32 pp2048 336.67 470.11 1.40
P40 llama 8B Q5_0 5.21 64 pp2048 582.42 618.02 1.06
P40 llama 8B Q5_0 5.21 128 pp2048 692.01 727.70 1.05
P40 llama 8B Q5_0 5.21 256 pp2048 778.30 798.69 1.03
P40 llama 8B Q5_0 5.21 512 pp2048 818.15 831.19 1.02
P40 llama 8B Q5_0 5.21 1024 pp2048 805.07 822.36 1.02
P40 llama 8B Q5_0 5.21 2048 pp2048 771.91 784.71 1.02
P40 llama 8B Q5_1 5.64 16 pp2048 189.62 376.34 1.98
P40 llama 8B Q5_1 5.64 32 pp2048 362.68 492.54 1.36
P40 llama 8B Q5_1 5.64 64 pp2048 621.65 611.85 0.98
P40 llama 8B Q5_1 5.64 128 pp2048 728.58 718.30 0.99
P40 llama 8B Q5_1 5.64 256 pp2048 804.45 791.69 0.98
P40 llama 8B Q5_1 5.64 512 pp2048 837.05 821.23 0.98
P40 llama 8B Q5_1 5.64 1024 pp2048 827.30 811.67 0.98
P40 llama 8B Q5_1 5.64 2048 pp2048 788.40 774.42 0.98
P40 llama 8B Q5_K_S 5.21 16 pp2048 159.99 303.61 1.90
P40 llama 8B Q5_K_S 5.21 32 pp2048 308.63 444.23 1.44
P40 llama 8B Q5_K_S 5.21 64 pp2048 539.97 574.33 1.06
P40 llama 8B Q5_K_S 5.21 128 pp2048 636.03 665.92 1.05
P40 llama 8B Q5_K_S 5.21 256 pp2048 710.62 735.23 1.03
P40 llama 8B Q5_K_S 5.21 512 pp2048 748.52 772.74 1.03
P40 llama 8B Q5_K_S 5.21 1024 pp2048 739.21 757.22 1.02
P40 llama 8B Q5_K_S 5.21 2048 pp2048 708.23 720.81 1.02
P40 llama 8B Q6_K 6.14 16 pp2048 165.27 288.74 1.75
P40 llama 8B Q6_K 6.14 32 pp2048 319.00 444.93 1.39
P40 llama 8B Q6_K 6.14 64 pp2048 556.81 558.13 1.00
P40 llama 8B Q6_K 6.14 128 pp2048 640.50 645.05 1.01
P40 llama 8B Q6_K 6.14 256 pp2048 707.44 706.76 1.00
P40 llama 8B Q6_K 6.14 512 pp2048 730.83 730.12 1.00
P40 llama 8B Q6_K 6.14 1024 pp2048 714.69 715.34 1.00
P40 llama 8B Q6_K 6.14 2048 pp2048 681.95 682.31 1.00
P40 llama 8B Q8_0 7.95 16 pp2048 188.85 325.28 1.72
P40 llama 8B Q8_0 7.95 32 pp2048 359.95 504.38 1.40
P40 llama 8B Q8_0 7.95 64 pp2048 617.11 618.68 1.00
P40 llama 8B Q8_0 7.95 128 pp2048 735.56 740.70 1.01
P40 llama 8B Q8_0 7.95 256 pp2048 820.26 825.80 1.01
P40 llama 8B Q8_0 7.95 512 pp2048 864.06 867.37 1.00
P40 llama 8B Q8_0 7.95 1024 pp2048 849.23 854.94 1.01
P40 llama 8B Q8_0 7.95 2048 pp2048 803.25 810.04 1.01
RX 6800
GPU Model Model Size [GiB] Microbatch size Test t/s master t/s PR Speedup
RX 6800 llama 8B Q2_K_M 2.95 16 pp2048 58.49 149.06 2.55
RX 6800 llama 8B Q2_K_M 2.95 32 pp2048 112.82 181.02 1.60
RX 6800 llama 8B Q2_K_M 2.95 64 pp2048 206.07 223.62 1.09
RX 6800 llama 8B Q2_K_M 2.95 128 pp2048 299.32 265.56 0.89
RX 6800 llama 8B Q2_K_M 2.95 256 pp2048 354.63 318.22 0.90
RX 6800 llama 8B Q2_K_M 2.95 512 pp2048 370.71 333.98 0.90
RX 6800 llama 8B Q2_K_M 2.95 1024 pp2048 399.86 354.89 0.89
RX 6800 llama 8B Q2_K_M 2.95 2048 pp2048 383.69 343.00 0.89
RX 6800 llama 8B Q3_K_S 3.41 16 pp2048 44.39 105.57 2.38
RX 6800 llama 8B Q3_K_S 3.41 32 pp2048 86.36 139.41 1.61
RX 6800 llama 8B Q3_K_S 3.41 64 pp2048 161.08 197.02 1.22
RX 6800 llama 8B Q3_K_S 3.41 128 pp2048 293.91 228.36 0.78
RX 6800 llama 8B Q3_K_S 3.41 256 pp2048 349.22 276.80 0.79
RX 6800 llama 8B Q3_K_S 3.41 512 pp2048 365.31 289.85 0.79
RX 6800 llama 8B Q3_K_S 3.41 1024 pp2048 394.24 310.18 0.79
RX 6800 llama 8B Q3_K_S 3.41 2048 pp2048 377.73 300.98 0.80
RX 6800 llama 8B Q4_0 4.33 16 pp2048 135.80 294.79 2.17
RX 6800 llama 8B Q4_0 4.33 32 pp2048 249.94 367.02 1.47
RX 6800 llama 8B Q4_0 4.33 64 pp2048 411.58 397.54 0.97
RX 6800 llama 8B Q4_0 4.33 128 pp2048 514.27 480.50 0.93
RX 6800 llama 8B Q4_0 4.33 256 pp2048 614.13 574.04 0.93
RX 6800 llama 8B Q4_0 4.33 512 pp2048 622.25 581.34 0.93
RX 6800 llama 8B Q4_0 4.33 1024 pp2048 699.92 641.77 0.92
RX 6800 llama 8B Q4_0 4.33 2048 pp2048 638.88 588.21 0.92
RX 6800 llama 8B Q4_1 4.77 16 pp2048 127.76 282.74 2.21
RX 6800 llama 8B Q4_1 4.77 32 pp2048 236.55 345.48 1.46
RX 6800 llama 8B Q4_1 4.77 64 pp2048 391.61 371.03 0.95
RX 6800 llama 8B Q4_1 4.77 128 pp2048 488.00 448.37 0.92
RX 6800 llama 8B Q4_1 4.77 256 pp2048 585.15 536.73 0.92
RX 6800 llama 8B Q4_1 4.77 512 pp2048 595.17 546.69 0.92
RX 6800 llama 8B Q4_1 4.77 1024 pp2048 668.59 600.58 0.90
RX 6800 llama 8B Q4_1 4.77 2048 pp2048 613.79 555.76 0.91
RX 6800 llama 8B Q4_K_S 4.36 16 pp2048 106.26 254.76 2.40
RX 6800 llama 8B Q4_K_S 4.36 32 pp2048 198.41 308.39 1.55
RX 6800 llama 8B Q4_K_S 4.36 64 pp2048 339.52 331.46 0.98
RX 6800 llama 8B Q4_K_S 4.36 128 pp2048 422.69 384.16 0.91
RX 6800 llama 8B Q4_K_S 4.36 256 pp2048 508.84 465.98 0.92
RX 6800 llama 8B Q4_K_S 4.36 512 pp2048 522.20 479.15 0.92
RX 6800 llama 8B Q4_K_S 4.36 1024 pp2048 579.60 522.77 0.90
RX 6800 llama 8B Q4_K_S 4.36 2048 pp2048 539.72 490.91 0.91
RX 6800 llama 8B Q5_0 5.21 16 pp2048 112.67 233.87 2.08
RX 6800 llama 8B Q5_0 5.21 32 pp2048 210.79 325.91 1.55
RX 6800 llama 8B Q5_0 5.21 64 pp2048 358.89 378.23 1.05
RX 6800 llama 8B Q5_0 5.21 128 pp2048 439.28 459.85 1.05
RX 6800 llama 8B Q5_0 5.21 256 pp2048 530.10 545.51 1.03
RX 6800 llama 8B Q5_0 5.21 512 pp2048 542.61 551.58 1.02
RX 6800 llama 8B Q5_0 5.21 1024 pp2048 602.53 606.02 1.01
RX 6800 llama 8B Q5_0 5.21 2048 pp2048 559.27 561.57 1.00
RX 6800 llama 8B Q5_1 5.64 16 pp2048 111.88 244.02 2.18
RX 6800 llama 8B Q5_1 5.64 32 pp2048 208.42 320.88 1.54
RX 6800 llama 8B Q5_1 5.64 64 pp2048 352.59 364.24 1.03
RX 6800 llama 8B Q5_1 5.64 128 pp2048 441.87 448.65 1.02
RX 6800 llama 8B Q5_1 5.64 256 pp2048 533.61 534.14 1.00
RX 6800 llama 8B Q5_1 5.64 512 pp2048 548.23 544.50 0.99
RX 6800 llama 8B Q5_1 5.64 1024 pp2048 612.32 600.05 0.98
RX 6800 llama 8B Q5_1 5.64 2048 pp2048 568.64 555.11 0.98
RX 6800 llama 8B Q5_K_S 5.21 16 pp2048 105.13 240.53 2.29
RX 6800 llama 8B Q5_K_S 5.21 32 pp2048 197.37 309.14 1.57
RX 6800 llama 8B Q5_K_S 5.21 64 pp2048 338.16 347.17 1.03
RX 6800 llama 8B Q5_K_S 5.21 128 pp2048 415.69 409.04 0.98
RX 6800 llama 8B Q5_K_S 5.21 256 pp2048 501.05 495.18 0.99
RX 6800 llama 8B Q5_K_S 5.21 512 pp2048 514.55 505.02 0.98
RX 6800 llama 8B Q5_K_S 5.21 1024 pp2048 569.75 552.43 0.97
RX 6800 llama 8B Q5_K_S 5.21 2048 pp2048 530.59 514.40 0.97
RX 6800 llama 8B Q6_K 6.14 16 pp2048 100.05 217.93 2.18
RX 6800 llama 8B Q6_K 6.14 32 pp2048 188.27 273.49 1.45
RX 6800 llama 8B Q6_K 6.14 64 pp2048 325.44 315.25 0.97
RX 6800 llama 8B Q6_K 6.14 128 pp2048 405.34 369.47 0.91
RX 6800 llama 8B Q6_K 6.14 256 pp2048 481.35 445.90 0.93
RX 6800 llama 8B Q6_K 6.14 512 pp2048 493.57 457.38 0.93
RX 6800 llama 8B Q6_K 6.14 1024 pp2048 541.72 497.17 0.92
RX 6800 llama 8B Q6_K 6.14 2048 pp2048 506.35 467.97 0.92
RX 6800 llama 8B Q8_0 7.95 16 pp2048 138.98 257.55 1.85
RX 6800 llama 8B Q8_0 7.95 32 pp2048 254.40 347.38 1.37
RX 6800 llama 8B Q8_0 7.95 64 pp2048 415.66 423.43 1.02
RX 6800 llama 8B Q8_0 7.95 128 pp2048 523.82 521.96 1.00
RX 6800 llama 8B Q8_0 7.95 256 pp2048 623.64 624.59 1.00
RX 6800 llama 8B Q8_0 7.95 512 pp2048 631.31 631.18 1.00
RX 6800 llama 8B Q8_0 7.95 1024 pp2048 710.66 705.28 0.99
RX 6800 llama 8B Q8_0 7.95 2048 pp2048 645.95 641.26 0.99

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Jun 3, 2024

The compilation time increases due to the additional template instances but I think the increase is acceptable:

CPU Configuration Compilation time master [s] Compilation time PR [s]
Ryzen 5950 X (95W power limit) CUDA, RTX 3090 28.903 33.360
Ryzen 5950 X (95W power limit) CUDA, RTX 3090, LLAMA_CUDA_FORCE_MMQ 31.322 46.539
Xeon E5-2683 v4 CUDA, P40 52.557 59.108
Xeon E5-2683 v4 HIP, RX 6800 57.050 64.020
EPYC 7742 CUDA, RTX 4090 32.835 33.558
EPYC 7742 CUDA, RTX 4090, LLAMA_CUDA_FORCE_MMQ 36.682 51.276

ggml-cuda/common.cuh Outdated Show resolved Hide resolved
@github-actions github-actions bot added build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 3, 2024
@JohannesGaessler
Copy link
Collaborator Author

This PR seems to increase the throughput on the server but not by much:

Slots Iterations master Iterations PR
8 2021 2022
16 2104 2134
32 2157 2285

Hardware is 1x RTX 4090.

Command
export BENCH_K6_BIN_PATH=~/go/bin/k6
export LLAMA_SERVER_BIN_PATH=../../../server
export SERVER_BENCH_URL=http://localhost:$(echo 1337 + $CUDA_VISIBLE_DEVICES | bc)/v1

python bench.py \
    --port $(echo 1337 + $CUDA_VISIBLE_DEVICES | bc)\
    --runner-label local \
    --name local \
    --branch `git rev-parse --abbrev-ref HEAD` \
    --commit `git rev-parse HEAD` \
    --scenario script.js \
    --duration 10m \
    --hf-repo ggml-org/models    \
    --hf-file phi-2/ggml-model-q4_0.gguf \
    --model-path-prefix models \
    --parallel $1 \
    -ngl 33 \
    --batch-size 2048 \
    --ubatch-size 256 \
    --ctx-size 65536 \
    --n-prompts 10000 \
    --max-prompt-tokens 1024 \
    --max-tokens 2048 \

Copy link
Contributor

github-actions bot commented Jun 4, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 540 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8672.23ms p(95)=21381.91ms fails=, finish reason: stop=479 truncated=61
  • Prompt processing (pp): avg=103.76tk/s p(95)=426.55tk/s
  • Token generation (tg): avg=45.79tk/s p(95)=48.01tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=deduplicate-mmq-12 commit=fd65ff31e9fe3ea8e12087b35b0d4beaab05107f

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717598184 --> 1717598814
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 284.56, 284.56, 284.56, 284.56, 284.56, 639.65, 639.65, 639.65, 639.65, 639.65, 691.48, 691.48, 691.48, 691.48, 691.48, 716.87, 716.87, 716.87, 716.87, 716.87, 779.92, 779.92, 779.92, 779.92, 779.92, 807.45, 807.45, 807.45, 807.45, 807.45, 806.97, 806.97, 806.97, 806.97, 806.97, 821.51, 821.51, 821.51, 821.51, 821.51, 820.73, 820.73, 820.73, 820.73, 820.73, 834.51, 834.51, 834.51, 834.51, 834.51, 834.18, 834.18, 834.18, 834.18, 834.18, 852.02, 852.02, 852.02, 852.02, 852.02, 895.21, 895.21, 895.21, 895.21, 895.21, 890.4, 890.4, 890.4, 890.4, 890.4, 907.71, 907.71, 907.71, 907.71, 907.71, 909.81, 909.81, 909.81, 909.81, 909.81, 907.63, 907.63, 907.63, 907.63, 907.63, 910.84, 910.84, 910.84, 910.84, 910.84, 910.16, 910.16, 910.16, 910.16, 910.16, 914.19, 914.19, 914.19, 914.19, 914.19, 922.74, 922.74, 922.74, 922.74, 922.74, 920.38, 920.38, 920.38, 920.38, 920.38, 916.67, 916.67, 916.67, 916.67, 916.67, 919.8, 919.8, 919.8, 919.8, 919.8, 920.38, 920.38, 920.38, 920.38, 920.38, 929.67, 929.67, 929.67, 929.67, 929.67, 877.33, 877.33, 877.33, 877.33, 877.33, 879.33, 879.33, 879.33, 879.33, 879.33, 894.27, 894.27, 894.27, 894.27, 894.27, 890.95, 890.95, 890.95, 890.95, 890.95, 891.31, 891.31, 891.31, 891.31, 891.31, 891.39, 891.39, 891.39, 891.39, 891.39, 895.65, 895.65, 895.65, 895.65, 895.65, 893.77, 893.77, 893.77, 893.77, 893.77, 891.09, 891.09, 891.09, 891.09, 891.09, 894.11, 894.11, 894.11, 894.11, 894.11, 904.38, 904.38, 904.38, 904.38, 904.38, 903.88, 903.88, 903.88, 903.88, 903.88, 910.75, 910.75, 910.75, 910.75, 910.75, 907.71, 907.71, 907.71, 907.71, 907.71, 907.23, 907.23, 907.23, 907.23, 907.23, 909.41, 909.41, 909.41, 909.41, 909.41, 909.81, 909.81, 909.81, 909.81, 909.81, 913.35, 913.35, 913.35, 913.35, 913.35, 898.63, 898.63, 898.63, 898.63, 898.63, 878.95, 878.95, 878.95, 878.95, 878.95, 865.17, 865.17, 865.17, 865.17, 865.17, 864.05, 864.05, 864.05, 864.05, 864.05, 866.21, 866.21, 866.21, 866.21, 866.21, 867.37, 867.37, 867.37, 867.37, 867.37, 867.07, 867.07, 867.07, 867.07, 867.07, 842.25, 842.25, 842.25, 842.25, 842.25, 840.64, 840.64, 840.64, 840.64, 840.64, 844.31, 844.31, 844.31, 844.31, 844.31, 845.54, 845.54, 845.54, 845.54, 845.54, 852.34, 852.34, 852.34, 852.34, 852.34, 854.39, 854.39, 854.39, 854.39, 854.39, 854.14, 854.14, 854.14, 854.14, 854.14, 855.98, 855.98, 855.98, 855.98, 855.98, 857.86, 857.86, 857.86, 857.86, 857.86, 858.4, 858.4, 858.4, 858.4, 858.4, 858.4]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717598184 --> 1717598814
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.56, 38.56, 38.56, 38.56, 38.56, 34.47, 34.47, 34.47, 34.47, 34.47, 26.76, 26.76, 26.76, 26.76, 26.76, 28.11, 28.11, 28.11, 28.11, 28.11, 29.3, 29.3, 29.3, 29.3, 29.3, 30.29, 30.29, 30.29, 30.29, 30.29, 31.98, 31.98, 31.98, 31.98, 31.98, 33.04, 33.04, 33.04, 33.04, 33.04, 33.45, 33.45, 33.45, 33.45, 33.45, 33.41, 33.41, 33.41, 33.41, 33.41, 33.21, 33.21, 33.21, 33.21, 33.21, 32.81, 32.81, 32.81, 32.81, 32.81, 32.86, 32.86, 32.86, 32.86, 32.86, 32.33, 32.33, 32.33, 32.33, 32.33, 32.15, 32.15, 32.15, 32.15, 32.15, 30.72, 30.72, 30.72, 30.72, 30.72, 30.41, 30.41, 30.41, 30.41, 30.41, 28.67, 28.67, 28.67, 28.67, 28.67, 28.73, 28.73, 28.73, 28.73, 28.73, 29.15, 29.15, 29.15, 29.15, 29.15, 29.01, 29.01, 29.01, 29.01, 29.01, 28.85, 28.85, 28.85, 28.85, 28.85, 28.69, 28.69, 28.69, 28.69, 28.69, 28.81, 28.81, 28.81, 28.81, 28.81, 29.01, 29.01, 29.01, 29.01, 29.01, 29.22, 29.22, 29.22, 29.22, 29.22, 29.21, 29.21, 29.21, 29.21, 29.21, 29.4, 29.4, 29.4, 29.4, 29.4, 29.66, 29.66, 29.66, 29.66, 29.66, 29.62, 29.62, 29.62, 29.62, 29.62, 29.98, 29.98, 29.98, 29.98, 29.98, 29.99, 29.99, 29.99, 29.99, 29.99, 30.18, 30.18, 30.18, 30.18, 30.18, 30.19, 30.19, 30.19, 30.19, 30.19, 30.23, 30.23, 30.23, 30.23, 30.23, 30.5, 30.5, 30.5, 30.5, 30.5, 30.27, 30.27, 30.27, 30.27, 30.27, 30.25, 30.25, 30.25, 30.25, 30.25, 30.03, 30.03, 30.03, 30.03, 30.03, 29.95, 29.95, 29.95, 29.95, 29.95, 30.1, 30.1, 30.1, 30.1, 30.1, 30.17, 30.17, 30.17, 30.17, 30.17, 30.27, 30.27, 30.27, 30.27, 30.27, 30.37, 30.37, 30.37, 30.37, 30.37, 30.15, 30.15, 30.15, 30.15, 30.15, 30.03, 30.03, 30.03, 30.03, 30.03, 29.83, 29.83, 29.83, 29.83, 29.83, 29.26, 29.26, 29.26, 29.26, 29.26, 28.77, 28.77, 28.77, 28.77, 28.77, 28.74, 28.74, 28.74, 28.74, 28.74, 28.86, 28.86, 28.86, 28.86, 28.86, 28.93, 28.93, 28.93, 28.93, 28.93, 29.01, 29.01, 29.01, 29.01, 29.01, 29.1, 29.1, 29.1, 29.1, 29.1, 29.12, 29.12, 29.12, 29.12, 29.12, 29.06, 29.06, 29.06, 29.06, 29.06, 29.14, 29.14, 29.14, 29.14, 29.14, 29.24, 29.24, 29.24, 29.24, 29.24, 29.36, 29.36, 29.36, 29.36, 29.36, 29.46, 29.46, 29.46, 29.46, 29.46, 29.56, 29.56, 29.56, 29.56, 29.56, 29.7]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717598184 --> 1717598814
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.09, 0.09, 0.09, 0.09, 0.09, 0.35, 0.35, 0.35, 0.35, 0.35, 0.31, 0.31, 0.31, 0.31, 0.31, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25, 0.25, 0.25, 0.25, 0.25, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.33, 0.33, 0.33, 0.33, 0.33, 0.29, 0.29, 0.29, 0.29, 0.29, 0.48, 0.48, 0.48, 0.48, 0.48, 0.4, 0.4, 0.4, 0.4, 0.4, 0.33, 0.33, 0.33, 0.33, 0.33, 0.24, 0.24, 0.24, 0.24, 0.24, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.29, 0.29, 0.29, 0.29, 0.29, 0.33, 0.33, 0.33, 0.33, 0.33, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.28, 0.28, 0.28, 0.28, 0.28, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.31, 0.31, 0.31, 0.31, 0.31, 0.12, 0.12, 0.12, 0.12, 0.12, 0.08, 0.08, 0.08, 0.08, 0.08, 0.13, 0.13, 0.13, 0.13, 0.13, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.29, 0.29, 0.29, 0.29, 0.29, 0.41, 0.41, 0.41, 0.41, 0.41, 0.48, 0.48, 0.48, 0.48, 0.48, 0.54, 0.54, 0.54, 0.54, 0.54, 0.54, 0.54, 0.54, 0.54, 0.54, 0.25, 0.25, 0.25, 0.25, 0.25, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717598184 --> 1717598814
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0]
                    
Loading

Comment on lines +425 to +433
static constexpr __device__ dequantize_kernel_t get_dequantize_kernel(ggml_type type) {
return type == GGML_TYPE_Q4_0 ? dequantize_q4_0 :
type == GGML_TYPE_Q4_1 ? dequantize_q4_1 :
type == GGML_TYPE_Q5_0 ? dequantize_q5_0 :
type == GGML_TYPE_Q5_1 ? dequantize_q5_1 :
type == GGML_TYPE_Q8_0 ? dequantize_q8_0 :
type == GGML_TYPE_F16 ? convert_f16 :
nullptr;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could also be moved to ggml_cuda_type_traits.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are circular dependencies between common.cuh and dequantize.cuh okay?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but there is probably too much in common.cuh.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we for now just keep this as-is?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, it's just a suggestion.

ggml-cuda/mmq.cuh Show resolved Hide resolved

// -------------------------------------------------------------------------------------------------------------------------------------

static constexpr __device__ int get_need_sum(ggml_type type) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these could potentially be moved too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These things are MMQ-specific and I would prefer to keep them together. I would have made another template struct if I was aware of a simple way to do this that also includes the functions for loading tiles and doing the vector dot products. But I don't know how to do this in such a way that I can still correctly pass the template arguments without having to resort to preprocessor macros.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to specialize a template partially, eg:

template<int x, int y, ggml_type type>
struct mmq_type_traits;

template<int x, int y>
struct mmq_type_traits<x, y, GGML_TYPE_F16> {
    static constexpr int qk = 1;
    static constexpr int qr = 1;
    static constexpr int z = x+2;
};

@mofosyne mofosyne added refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels Jun 5, 2024
@JohannesGaessler JohannesGaessler merged commit 7d1a378 into ggerganov:master Jun 5, 2024
71 checks passed
@daniandtheweb daniandtheweb mentioned this pull request Jun 24, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants