Skip to content

[CANN]: add the basic supports of Flash Attention kernel #13627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

shibizhao
Copy link

Dear authors,

This PR enhances the CANN backend with the FA kernel. Currently, it can only support the F16 KV tensors and no logit softcap. We have tested the kernel on Ascend 910B using test-op-backends.

Thanks.

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels May 19, 2025
@shibizhao shibizhao changed the title cann: add the basic supports of Flash Attention kernel [CANN]: add the basic supports of Flash Attention kernel May 20, 2025
@hipudding hipudding self-requested a review May 20, 2025 06:06
@hipudding hipudding added the Ascend NPU issues specific to Ascend NPUs label May 20, 2025
@shibizhao
Copy link
Author

BTW, we only test them on 910B. In our school, the CANN environment for 310P server is 7.x, so we cannot compile the llama.cpp with CANN backend.

@hipudding
Copy link
Collaborator

BTW, we only test them on 910B. In our school, the CANN environment for 310P server is 7.x, so we cannot compile the llama.cpp with CANN backend.

We can test it with 310P.

@shibizhao
Copy link
Author

Evaluation Report on Ascend 910B + Kunpeng 920

Authors from Peking University: Bizhao Shi, Yuxin Yang, Ruiyang Ma, Guojie Luo

Llama-7B-f16

Scripts

./build/bin/llama-batched-bench -c 65536 -m ~/models/LLM-Research/Llama-2-7B_f16.gguf -npp 128,256,512 -ntg 128,256,512 -npl 1,2,4,8,16,32,64 --split-mode none --main-gpu 0 -ngl 999 [-fa]

With FA

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 2.611 49.02 3.437 37.24 6.048 42.33
128 128 2 512 3.234 79.15 4.566 56.07 7.800 65.64
128 128 4 1024 4.717 108.55 6.290 81.40 11.006 93.04
128 128 8 2048 7.833 130.73 9.551 107.21 17.384 117.81
128 128 16 4096 13.672 149.80 16.037 127.71 29.708 137.87
128 128 32 8192 25.933 157.95 28.974 141.37 54.907 149.20
128 128 64 16384 50.124 163.43 55.615 147.30 105.739 154.95
128 256 1 384 2.479 51.64 6.931 36.93 9.410 40.81
128 256 2 768 3.234 79.17 9.256 55.31 12.490 61.49
128 256 4 1536 4.692 109.13 12.719 80.51 17.411 88.22
128 256 8 3072 7.699 133.01 19.338 105.91 27.037 113.62
128 256 16 6144 13.755 148.89 32.652 125.44 46.407 132.40
128 256 32 12288 25.822 158.62 59.660 137.31 85.483 143.75
128 256 64 24576 50.207 163.16 116.131 141.08 166.338 147.75
128 512 1 640 2.485 51.51 13.985 36.61 16.470 38.86
128 512 2 1280 3.196 80.11 18.666 54.86 21.861 58.55
128 512 4 2560 4.709 108.72 25.902 79.07 30.611 83.63
128 512 8 5120 7.694 133.10 39.781 102.96 47.474 107.85
128 512 16 10240 13.722 149.25 68.060 120.36 81.781 125.21
128 512 32 20480 25.824 158.62 126.612 129.40 152.436 134.35
128 512 64 40960 50.331 162.76 249.659 131.25 299.991 136.54
256 128 1 384 3.253 78.70 3.527 36.29 6.780 56.63
256 128 2 768 4.687 109.24 5.063 50.57 9.750 78.77
256 128 4 1536 7.689 133.17 6.436 79.55 14.125 108.74
256 128 8 3072 13.735 149.11 9.771 104.80 23.506 130.69
256 128 16 6144 25.776 158.91 16.527 123.92 42.303 145.24
256 128 32 12288 50.121 163.44 30.283 135.26 80.404 152.83
256 128 64 24576 99.680 164.37 59.057 138.71 158.737 154.82
256 256 1 512 3.277 78.12 7.003 36.55 10.280 49.81
256 256 2 1024 4.720 108.47 9.356 54.72 14.076 72.75
256 256 4 2048 7.905 129.54 12.960 79.01 20.865 98.16
256 256 8 4096 13.704 149.45 19.782 103.53 33.486 122.32
256 256 16 8192 25.773 158.92 33.657 121.70 59.430 137.84
256 256 32 16384 50.156 163.33 62.378 131.33 112.534 145.59
256 256 64 32768 99.887 164.02 122.710 133.52 222.597 147.21
256 512 1 768 3.278 78.10 14.091 36.33 17.369 44.22
256 512 2 1536 4.698 108.97 18.941 54.06 23.639 64.98
256 512 4 3072 7.722 132.60 26.353 77.71 34.075 90.15
256 512 8 6144 13.717 149.31 40.608 100.87 54.325 113.10
256 512 16 12288 25.824 158.61 70.141 116.79 95.965 128.05
256 512 32 24576 50.250 163.03 132.627 123.53 182.877 134.39
256 512 64 49152 100.315 163.32 264.479 123.90 364.795 134.74
512 128 1 640 4.751 107.76 3.569 35.86 8.321 76.92
512 128 2 1280 7.707 132.87 4.811 53.21 12.518 102.25
512 128 4 2560 13.720 149.27 6.666 76.81 20.386 125.58
512 128 8 5120 25.769 158.95 10.230 100.10 35.999 142.23
512 128 16 10240 50.174 163.27 17.592 116.42 67.766 151.11
512 128 32 20480 99.779 164.20 32.819 124.80 132.598 154.45
512 128 64 40960 201.561 162.57 64.568 126.87 266.129 153.91
512 256 1 768 4.701 108.92 7.126 35.93 11.826 64.94
512 256 2 1536 7.675 133.43 9.659 53.01 17.333 88.61
512 256 4 3072 13.732 149.14 13.344 76.74 27.076 113.46
512 256 8 6144 25.798 158.77 20.720 98.84 46.518 132.08
512 256 16 12288 50.158 163.32 35.775 114.49 85.934 142.99
512 256 32 24576 99.680 164.37 67.347 121.64 167.027 147.14
512 256 64 49152 202.263 162.01 133.739 122.51 336.001 146.29
512 512 1 1024 4.738 108.06 14.405 35.54 19.143 53.49
512 512 2 2048 7.701 132.96 19.593 52.26 27.294 75.03
512 512 4 4096 13.742 149.03 27.127 75.50 40.869 100.22
512 512 8 8192 25.772 158.93 42.481 96.42 68.253 120.02
512 512 16 16384 50.185 163.24 74.460 110.02 124.644 131.45
512 512 32 32768 100.074 163.72 142.046 115.34 242.121 135.34
512 512 64 65536 203.587 160.95 287.718 113.89 491.306 133.39

Without FA

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 2.484 51.52 6.060 21.12 8.544 29.96
128 128 2 512 3.382 75.70 7.849 32.61 11.231 45.59
128 128 4 1024 4.705 108.81 9.930 51.56 14.636 69.97
128 128 8 2048 7.809 131.13 14.094 72.66 21.903 93.50
128 128 16 4096 13.959 146.72 23.576 86.87 37.535 109.12
128 128 32 8192 26.398 155.16 43.669 93.80 70.067 116.92
128 128 64 16384 52.106 157.22 86.425 94.79 138.531 118.27
128 256 1 384 2.472 51.79 12.770 20.05 15.242 25.19
128 256 2 768 3.206 79.85 16.053 31.89 19.259 39.88
128 256 4 1536 4.734 108.15 20.449 50.08 25.183 60.99
128 256 8 3072 7.737 132.35 31.228 65.58 38.965 78.84
128 256 16 6144 13.923 147.10 52.560 77.93 66.483 92.42
128 256 32 12288 26.282 155.85 99.787 82.09 126.070 97.47
128 256 64 24576 52.504 156.03 199.672 82.05 252.175 97.46
128 512 1 640 2.492 51.37 26.737 19.15 29.229 21.90
128 512 2 1280 3.257 78.61 33.246 30.80 36.503 35.07
128 512 4 2560 4.732 108.19 45.475 45.04 50.208 50.99
128 512 8 5120 7.774 131.72 73.254 55.92 81.028 63.19
128 512 16 10240 13.811 148.29 127.577 64.21 141.388 72.42
128 512 32 20480 26.469 154.75 247.673 66.15 274.141 74.71
128 512 64 40960 52.364 156.44 505.619 64.81 557.983 73.41
256 128 1 384 3.255 78.65 6.703 19.09 9.958 38.56
256 128 2 768 4.704 108.84 8.188 31.26 12.892 59.57
256 128 4 1536 7.804 131.21 10.529 48.63 18.333 83.78
256 128 8 3072 13.859 147.77 17.210 59.50 31.069 98.88
256 128 16 6144 26.449 154.86 29.389 69.69 55.838 110.03
256 128 32 12288 52.430 156.25 56.929 71.95 109.360 112.36
256 128 64 24576 109.508 149.61 116.553 70.29 226.062 108.71
256 256 1 512 3.263 78.45 13.597 18.83 16.861 30.37
256 256 2 1024 4.726 108.33 16.703 30.65 21.429 47.79
256 256 4 2048 7.751 132.10 22.090 46.36 29.841 68.63
256 256 8 4096 13.925 147.08 36.786 55.67 50.711 80.77
256 256 16 8192 26.297 155.76 63.942 64.06 90.238 90.78
256 256 32 16384 52.825 155.08 125.673 65.19 178.498 91.79
256 256 64 32768 109.027 150.28 260.214 62.96 369.240 88.74
256 512 1 768 3.276 78.16 27.850 18.38 31.126 24.67
256 512 2 1536 4.746 107.89 34.535 29.65 39.280 39.10
256 512 4 3072 7.785 131.53 50.180 40.81 57.965 53.00
256 512 8 6144 13.847 147.90 84.451 48.50 98.298 62.50
256 512 16 12288 26.463 154.78 151.157 54.20 177.621 69.18
256 512 32 24576 52.490 156.07 300.016 54.61 352.506 69.72
256 512 64 49152 109.592 149.50 616.284 53.17 725.876 67.71
512 128 1 640 4.724 108.39 7.043 18.17 11.767 54.39
512 128 2 1280 7.782 131.59 8.667 29.54 16.449 77.81
512 128 4 2560 13.852 147.84 13.416 38.16 27.269 93.88
512 128 8 5120 26.476 154.71 22.455 45.60 48.930 104.64
512 128 16 10240 52.402 156.33 40.547 50.51 92.950 110.17
512 128 32 20480 109.698 149.36 81.209 50.44 190.907 107.28
512 128 64 40960 240.247 136.39 168.863 48.51 409.110 100.12
512 256 1 768 4.820 106.23 14.268 17.94 19.088 40.24
512 256 2 1536 7.776 131.68 17.944 28.53 25.720 59.72
512 256 4 3072 13.957 146.74 28.172 36.35 42.129 72.92
512 256 8 6144 26.305 155.71 48.356 42.35 74.661 82.29
512 256 16 12288 52.738 155.33 87.019 47.07 139.758 87.92
512 256 32 24576 109.195 150.04 174.406 46.97 283.602 86.66
512 256 64 49152 241.761 135.54 362.944 45.14 604.706 81.28
512 512 1 1024 4.768 107.38 29.159 17.56 33.928 30.18
512 512 2 2048 7.917 129.34 37.779 27.10 45.696 44.82
512 512 4 4096 13.819 148.20 61.382 33.36 75.201 54.47
512 512 8 8192 26.494 154.60 105.948 38.66 132.442 61.85
512 512 16 16384 52.440 156.22 196.027 41.79 248.467 65.94
512 512 32 32768 109.805 149.21 397.307 41.24 507.111 64.62
512 512 64 65536 240.651 136.16 826.526 39.65 1067.177 61.41

Qwen3-14B-Q8_0

Scripts

./build/bin/llama-batched-bench -c 65536 -m ~/models/Qwen/Qwen3-14B_Q8_0.gguf -npp 128,256,512 -ntg 128,256,512 -npl 1,2,4,8,16,32,64 --split-mode none --main-gpu 0 -ngl 999 [-fa]

With FA

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.602 212.59 6.487 19.73 7.090 36.11
128 128 2 512 0.607 421.70 7.037 36.38 7.644 66.98
128 128 4 1024 0.688 744.68 7.358 69.58 8.046 127.27
128 128 8 2048 0.772 1325.58 7.802 131.25 8.574 238.85
128 128 16 4096 1.018 2011.85 8.279 247.36 9.297 440.56
128 128 32 8192 1.586 2582.21 8.555 478.76 10.142 807.76
128 128 64 16384 2.903 2822.21 11.454 715.22 14.357 1141.22
128 256 1 384 0.603 212.22 13.003 19.69 13.606 28.22
128 256 2 768 0.607 421.73 14.158 36.16 14.765 52.02
128 256 4 1536 0.652 785.62 14.690 69.71 15.342 100.12
128 256 8 3072 0.771 1328.02 15.646 130.90 16.417 187.12
128 256 16 6144 1.018 2012.46 16.967 241.41 17.985 341.62
128 256 32 12288 1.582 2589.00 18.573 441.08 20.155 609.69
128 256 64 24576 3.139 2609.93 27.013 606.52 30.152 815.08
128 512 1 640 0.598 214.04 26.102 19.62 26.700 23.97
128 512 2 1280 0.607 421.58 28.382 36.08 28.989 44.15
128 512 4 2560 0.651 786.33 29.619 69.15 30.270 84.57
128 512 8 5120 0.770 1330.21 31.999 128.01 32.768 156.25
128 512 16 10240 1.015 2017.34 35.463 231.00 36.478 280.72
128 512 32 20480 1.588 2579.61 42.133 388.86 43.721 468.42
128 512 64 40960 3.476 2356.84 68.228 480.27 71.704 571.24
256 128 1 384 0.614 417.18 6.545 19.56 7.159 53.64
256 128 2 768 0.650 787.68 7.091 36.10 7.741 99.21
256 128 4 1536 0.776 1319.87 7.371 69.47 8.146 188.55
256 128 8 3072 1.033 1982.38 7.815 131.02 8.848 347.18
256 128 16 6144 1.580 2593.14 8.513 240.56 10.093 608.74
256 128 32 12288 2.879 2845.02 9.449 433.46 12.329 996.68
256 128 64 24576 6.508 2517.54 13.927 588.21 20.435 1202.65
256 256 1 512 0.618 413.97 13.048 19.62 13.666 37.46
256 256 2 1024 0.651 786.79 14.185 36.09 14.836 69.02
256 256 4 2048 0.795 1288.67 14.851 68.95 15.646 130.90
256 256 8 4096 1.019 2009.16 15.872 129.03 16.891 242.49
256 256 16 8192 1.576 2599.33 17.524 233.74 19.100 428.91
256 256 32 16384 2.993 2737.06 20.359 402.37 23.352 701.60
256 256 64 32768 7.269 2253.92 31.833 514.69 39.102 838.02
256 512 1 768 0.611 418.70 26.117 19.60 26.728 28.73
256 512 2 1536 0.653 784.39 28.477 35.96 29.130 52.73
256 512 4 3072 0.788 1299.70 29.862 68.58 30.650 100.23
256 512 8 6144 1.031 1985.62 32.568 125.77 33.600 182.86
256 512 16 12288 1.608 2546.99 37.092 220.86 38.700 317.52
256 512 32 24576 3.415 2399.14 46.545 352.00 49.960 491.91
256 512 64 49152 9.144 1791.80 80.369 407.72 89.513 549.10
512 128 1 640 0.685 747.78 6.565 19.50 7.249 88.28
512 128 2 1280 0.770 1329.60 7.173 35.69 7.943 161.14
512 128 4 2560 1.043 1963.36 7.457 68.66 8.500 301.16
512 128 8 5120 1.581 2590.62 8.139 125.81 9.720 526.73
512 128 16 10240 2.889 2835.58 9.126 224.41 12.015 852.27
512 128 32 20480 6.469 2532.71 11.276 363.24 17.745 1154.11
512 128 64 40960 18.111 1809.33 18.655 439.14 36.765 1114.09
512 256 1 768 0.673 760.42 13.102 19.54 13.775 55.75
512 256 2 1536 0.771 1328.81 14.324 35.74 15.095 101.76
512 256 4 3072 1.017 2013.71 14.929 68.59 15.946 192.65
512 256 8 6144 1.567 2614.39 16.356 125.21 17.923 342.80
512 256 16 12288 2.928 2797.47 18.784 218.06 21.713 565.94
512 256 32 24576 7.076 2315.33 23.976 341.68 31.052 791.44
512 256 64 49152 20.831 1573.03 41.525 394.56 62.356 788.25
512 512 1 1024 0.684 748.55 26.237 19.51 26.921 38.04
512 512 2 2048 0.788 1299.71 28.848 35.50 29.636 69.11
512 512 4 4096 1.017 2013.14 30.145 67.94 31.162 131.44
512 512 8 8192 1.605 2551.98 33.498 122.27 35.104 233.37
512 512 16 16384 3.243 2526.05 39.920 205.21 43.163 379.58
512 512 32 32768 8.391 1952.63 54.109 302.80 62.499 524.29
512 512 64 65536 26.338 1244.13 101.482 322.89 127.820 512.72

Without FA

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.606 211.24 7.568 16.91 8.174 31.32
128 128 2 512 0.618 414.14 8.299 30.85 8.917 57.42
128 128 4 1024 0.667 767.67 8.753 58.50 9.419 108.71
128 128 8 2048 0.825 1241.92 9.671 105.88 10.496 195.13
128 128 16 4096 1.154 1774.80 11.384 179.90 12.538 326.68
128 128 32 8192 2.046 2001.83 14.565 281.23 16.611 493.18
128 128 64 16384 4.786 1711.72 24.961 328.20 29.746 550.79
128 256 1 384 0.615 208.12 15.360 16.67 15.975 24.04
128 256 2 768 0.616 415.43 16.725 30.61 17.341 44.29
128 256 4 1536 0.674 759.38 17.696 57.87 18.370 83.62
128 256 8 3072 0.806 1270.80 20.537 99.72 21.342 143.94
128 256 16 6144 1.131 1810.54 25.457 160.90 26.589 231.08
128 256 32 12288 2.070 1978.83 35.569 230.31 37.639 326.47
128 256 64 24576 4.926 1663.06 66.478 246.46 71.404 344.18
128 512 1 640 0.614 208.57 31.134 16.45 31.747 20.16
128 512 2 1280 0.632 405.02 34.036 30.09 34.668 36.92
128 512 4 2560 0.664 771.09 37.297 54.91 37.961 67.44
128 512 8 5120 0.806 1270.11 46.201 88.66 47.007 108.92
128 512 16 10240 1.132 1809.06 60.407 135.61 61.539 166.40
128 512 32 20480 2.100 1950.33 91.316 179.42 93.416 219.23
128 512 64 40960 4.985 1643.29 191.041 171.52 196.026 208.95
256 128 1 384 0.626 409.25 7.775 16.46 8.400 45.71
256 128 2 768 0.663 772.62 8.442 30.32 9.105 84.35
256 128 4 1536 0.838 1221.30 8.993 56.94 9.831 156.24
256 128 8 3072 1.127 1816.74 10.942 93.59 12.069 254.53
256 128 16 6144 2.074 1974.98 14.177 144.46 16.251 378.07
256 128 32 12288 5.026 1630.08 21.726 188.53 26.752 459.33
256 128 64 24576 15.583 1051.40 43.979 186.27 59.562 412.61
256 256 1 512 0.769 332.74 15.633 16.38 16.403 31.21
256 256 2 1024 0.663 772.40 17.040 30.05 17.703 57.84
256 256 4 2048 0.804 1273.13 18.433 55.55 19.237 106.46
256 256 8 4096 1.154 1775.04 23.135 88.52 24.289 168.64
256 256 16 8192 2.077 1971.71 30.118 136.00 32.195 254.45
256 256 32 16384 5.051 1621.86 47.640 171.95 52.691 310.94
256 256 64 32768 15.864 1032.79 103.328 158.56 119.192 274.92
256 512 1 768 0.626 409.04 31.577 16.21 32.203 23.85
256 512 2 1536 0.682 750.83 34.626 29.57 35.308 43.50
256 512 4 3072 0.804 1273.13 39.390 51.99 40.194 76.43
256 512 8 6144 1.127 1816.76 51.457 79.60 52.584 116.84
256 512 16 12288 2.085 1964.60 71.083 115.25 73.168 167.94
256 512 32 24576 5.073 1614.76 116.455 140.69 121.528 202.22
256 512 64 49152 15.913 1029.60 258.834 126.60 274.747 178.90
512 128 1 640 0.694 738.05 7.921 16.16 8.615 74.29
512 128 2 1280 0.824 1242.88 8.739 29.29 9.563 133.85
512 128 4 2560 1.154 1774.65 10.199 50.20 11.354 225.48
512 128 8 5120 2.089 1960.41 13.509 75.80 15.598 328.25
512 128 16 10240 5.020 1631.97 19.268 106.29 24.287 421.62
512 128 32 20480 15.561 1052.92 31.343 130.68 46.903 436.64
512 128 64 40960 55.053 595.21 73.735 111.10 128.788 318.04
512 256 1 768 0.691 741.36 15.951 16.05 16.642 46.15
512 256 2 1536 0.977 1048.15 17.594 29.10 18.571 82.71
512 256 4 3072 1.130 1812.90 20.972 48.83 22.102 138.99
512 256 8 6144 2.082 1967.61 28.300 72.37 30.381 202.23
512 256 16 12288 5.043 1624.30 40.820 100.34 45.863 267.93
512 256 32 24576 15.731 1041.51 69.208 118.37 84.939 289.34
512 256 64 49152 54.769 598.29 163.681 100.10 218.450 225.00
512 512 1 1024 0.710 721.24 32.244 15.88 32.954 31.07
512 512 2 2048 0.810 1263.62 36.088 28.37 36.899 55.50
512 512 4 4096 1.301 1573.70 44.401 46.13 45.702 89.62
512 512 8 8192 2.081 1968.19 59.872 68.41 61.953 132.23
512 512 16 16384 5.032 1628.14 88.457 92.61 93.489 175.25
512 512 32 32768 15.602 1050.14 160.120 102.32 175.722 186.48
512 512 64 65536 55.415 591.32 383.509 85.44 438.923 149.31

Qwen3-32B-Q8_0

Scripts

./build/bin/llama-batched-bench -c 65536 -m ~/models/Qwen/Qwen3-32B_Q8_0.gguf -npp 128,256,512 -ntg 128,256,512 -npl 1,2,4,8,16,32,64 --split-mode none --main-gpu 0 -ngl 999 [-fa]
###With FA

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.983 130.25 14.148 9.05 15.131 16.92
128 128 2 512 1.034 247.49 15.089 16.97 16.123 31.76
128 128 4 1024 1.115 459.12 15.569 32.89 16.684 61.38
128 128 8 2048 1.368 748.44 16.263 62.97 17.631 116.16
128 128 16 4096 1.958 1045.95 17.245 118.76 19.203 213.30
128 128 32 8192 3.132 1307.92 17.566 233.18 20.698 395.79
128 128 64 16384 6.018 1361.36 22.924 357.36 28.941 566.12
128 256 1 384 0.999 128.10 28.371 9.02 29.370 13.07
128 256 2 768 1.008 254.06 30.309 16.89 31.317 24.52
128 256 4 1536 1.144 447.51 31.315 32.70 32.459 47.32
128 256 8 3072 1.367 748.93 32.742 62.55 34.109 90.06
128 256 16 6144 1.974 1037.45 35.300 116.03 37.274 164.83
128 256 32 12288 3.135 1306.63 37.697 217.31 40.832 300.94
128 256 64 24576 6.087 1345.81 53.306 307.36 59.393 413.79
128 512 1 640 1.005 127.39 56.847 9.01 57.852 11.06
128 512 2 1280 1.008 254.05 60.989 16.79 61.997 20.65
128 512 4 2560 1.124 455.45 63.104 32.45 64.229 39.86
128 512 8 5120 1.524 671.92 66.717 61.39 68.241 75.03
128 512 16 10240 1.925 1063.78 74.475 110.00 76.400 134.03
128 512 32 20480 3.189 1284.28 85.010 192.73 88.200 232.20
128 512 64 40960 6.261 1308.40 131.042 250.06 137.304 298.32
256 128 1 384 1.022 250.58 14.242 8.99 15.264 25.16
256 128 2 768 1.109 461.69 15.240 16.80 16.349 46.98
256 128 4 1536 1.398 732.72 15.722 32.57 17.119 89.72
256 128 8 3072 1.924 1064.23 16.467 62.18 18.391 167.03
256 128 16 6144 3.185 1286.14 17.896 114.44 21.081 291.45
256 128 32 12288 5.964 1373.59 19.357 211.61 25.321 485.30
256 128 64 24576 13.538 1210.20 27.452 298.41 40.990 599.56
256 256 1 512 1.013 252.70 28.446 9.00 29.460 17.38
256 256 2 1024 1.117 458.19 30.528 16.77 31.645 32.36
256 256 4 2048 1.370 747.46 31.651 32.35 33.021 62.02
256 256 8 4096 1.945 1053.09 33.202 61.68 35.147 116.54
256 256 16 8192 3.130 1308.69 36.643 111.78 39.773 205.97
256 256 32 16384 6.016 1361.67 41.327 198.22 47.343 346.07
256 256 64 32768 14.307 1145.17 62.465 262.29 76.772 426.82
256 512 1 768 1.022 250.52 56.950 8.99 57.972 13.25
256 512 2 1536 1.107 462.42 61.316 16.70 62.423 24.61
256 512 4 3072 1.388 737.65 63.599 32.20 64.987 47.27
256 512 8 6144 1.926 1063.50 67.708 60.50 69.633 88.23
256 512 16 12288 3.149 1300.58 77.623 105.54 80.772 152.13
256 512 32 24576 6.249 1310.93 93.462 175.30 99.711 246.47
256 512 64 49152 16.148 1014.60 154.350 212.30 170.498 288.28
512 128 1 640 1.125 455.19 14.302 8.95 15.427 41.49
512 128 2 1280 1.368 748.35 15.475 16.54 16.843 76.00
512 128 4 2560 1.944 1053.30 15.952 32.10 17.896 143.05
512 128 8 5120 3.113 1315.70 17.024 60.15 20.137 254.26
512 128 16 10240 5.943 1378.32 19.469 105.19 25.413 402.95
512 128 32 20480 13.388 1223.76 22.824 179.46 36.212 565.55
512 128 64 40960 35.723 917.29 35.860 228.44 71.583 572.20
512 256 1 768 1.125 455.05 28.569 8.96 29.694 25.86
512 256 2 1536 1.371 746.85 30.843 16.60 32.215 47.68
512 256 4 3072 1.917 1068.44 31.912 32.09 33.829 90.81
512 256 8 6144 3.121 1312.36 34.308 59.69 37.429 164.15
512 256 16 12288 6.019 1361.01 39.928 102.59 45.947 267.44
512 256 32 24576 13.586 1205.92 48.301 169.60 61.888 397.11
512 256 64 49152 39.760 824.15 79.873 205.13 119.633 410.86
512 512 1 1024 1.141 448.76 57.190 8.95 58.331 17.56
512 512 2 2048 1.374 745.05 62.175 16.47 63.549 32.23
512 512 4 4096 1.924 1064.22 64.353 31.82 66.277 61.80
512 512 8 8192 3.145 1302.46 70.050 58.47 73.195 111.92
512 512 16 16384 6.039 1356.59 83.984 97.54 90.023 182.00
512 512 32 32768 14.991 1092.95 108.299 151.28 123.290 265.78
512 512 64 65536 47.882 684.34 190.131 172.34 238.014 275.35

Without FA

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 1.004 127.51 15.905 8.05 16.909 15.14
128 128 2 512 1.027 249.31 17.028 15.03 18.055 28.36
128 128 4 1024 1.138 450.02 17.789 28.78 18.926 54.10
128 128 8 2048 1.447 707.47 19.382 52.83 20.830 98.32
128 128 16 4096 2.222 921.80 22.671 90.34 24.892 164.55
128 128 32 8192 4.234 967.42 27.892 146.85 32.126 254.99
128 128 64 16384 10.369 790.04 48.891 167.56 59.260 276.48
128 256 1 384 0.990 129.32 32.083 7.98 33.072 11.61
128 256 2 768 1.022 250.41 34.382 14.89 35.404 21.69
128 256 4 1536 1.273 402.27 36.180 28.30 37.453 41.01
128 256 8 3072 1.445 708.70 41.076 49.86 42.521 72.25
128 256 16 6144 2.214 924.90 49.780 82.28 51.994 118.17
128 256 32 12288 4.273 958.68 66.292 123.57 70.565 174.14
128 256 64 24576 10.486 781.23 126.601 129.41 137.087 179.27
128 512 1 640 1.053 121.52 64.987 7.88 66.040 9.69
128 512 2 1280 1.047 244.59 69.936 14.64 70.983 18.03
128 512 4 2560 1.142 448.27 76.178 26.88 77.321 33.11
128 512 8 5120 1.450 706.33 91.764 44.64 93.214 54.93
128 512 16 10240 2.215 924.51 115.488 70.93 117.703 87.00
128 512 32 20480 4.262 961.15 168.092 97.47 172.354 118.83
128 512 64 40960 10.539 777.31 364.535 89.89 375.074 109.21
256 128 1 384 1.048 244.18 16.168 7.92 17.217 22.30
256 128 2 768 1.133 451.92 17.295 14.80 18.428 41.67
256 128 4 1536 1.448 707.39 18.378 27.86 19.826 77.48
256 128 8 3072 2.197 932.23 21.740 47.10 23.937 128.34
256 128 16 6144 4.293 954.17 27.218 75.24 31.511 194.98
256 128 32 12288 10.574 774.76 38.745 105.72 49.319 249.15
256 128 64 24576 32.279 507.58 79.954 102.46 112.233 218.97
256 256 1 512 1.040 246.23 32.519 7.87 33.559 15.26
256 256 2 1024 1.149 445.56 34.906 14.67 36.055 28.40
256 256 4 2048 1.452 705.03 37.642 27.20 39.094 52.39
256 256 8 4096 2.223 921.39 45.792 44.72 48.015 85.31
256 256 16 8192 4.265 960.46 57.395 71.37 61.659 132.86
256 256 32 16384 10.602 772.68 84.747 96.66 95.349 171.83
256 256 64 32768 32.336 506.68 186.740 87.74 219.076 149.57
256 512 1 768 1.062 241.02 65.598 7.81 66.660 11.52
256 512 2 1536 1.161 441.14 71.001 14.42 72.161 21.29
256 512 4 3072 1.448 707.42 79.918 25.63 81.366 37.76
256 512 8 6144 2.195 932.93 100.988 40.56 103.183 59.54
256 512 16 12288 4.417 927.32 132.675 61.75 137.092 89.63
256 512 32 24576 10.588 773.70 210.950 77.67 221.538 110.93
256 512 64 49152 32.333 506.73 480.575 68.19 512.907 95.83
512 128 1 640 1.155 443.14 16.448 7.78 17.604 36.36
512 128 2 1280 1.465 699.17 17.914 14.29 19.378 66.05
512 128 4 2560 2.195 933.02 20.583 24.87 22.779 112.39
512 128 8 5120 4.269 959.51 26.470 38.68 30.739 166.56
512 128 16 10240 10.604 772.55 35.764 57.26 46.368 220.84
512 128 32 20480 32.361 506.29 57.668 71.03 90.029 227.48

Qwen2-72B-Q8_0

Scripts

./build/bin/llama-batched-bench -c 65536 -m ~/models/Qwen/Qwen2-72B_q8_0.gguf -npp 128,256,512 -ntg 128,256,512 -npl 1,2,4,8,16,32,64 -ngl 999 -fa

With FA

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 1.372 93.30 38.387 3.33 39.759 6.44
128 128 2 512 1.448 176.84 35.743 7.16 37.191 13.77
128 128 4 1024 1.580 324.02 36.357 14.08 37.937 26.99
128 128 8 2048 2.066 495.71 37.195 27.53 39.261 52.16
128 128 16 4096 3.008 680.91 38.329 53.43 41.337 99.09
128 128 32 8192 5.225 783.98 34.634 118.26 39.859 205.53
128 128 64 16384 10.377 789.41 41.835 195.82 52.213 313.79
128 256 1 384 1.393 91.90 76.870 3.33 78.263 4.91
128 256 2 768 1.426 179.56 71.721 7.14 73.147 10.50
128 256 4 1536 1.601 319.75 72.916 14.04 74.517 20.61
128 256 8 3072 2.035 503.24 74.645 27.44 76.679 40.06
128 256 16 6144 3.039 674.02 77.652 52.75 80.691 76.14
128 256 32 12288 5.336 767.56 71.958 113.84 77.295 158.98
128 256 64 24576 10.752 761.93 90.031 181.98 100.782 243.85
128 512 1 640 1.412 90.67 154.127 3.32 155.539 4.11
128 512 2 1280 1.427 179.34 144.149 7.10 145.577 8.79
128 512 4 2560 1.585 322.98 146.922 13.94 148.507 17.24
128 512 8 5120 2.040 502.07 150.944 27.14 152.984 33.47
128 512 16 10240 3.043 672.96 158.487 51.69 161.530 63.39
128 512 32 20480 5.416 756.22 153.610 106.66 159.026 128.78
128 512 64 40960 11.169 733.45 203.749 160.83 214.918 190.58
256 128 1 384 1.477 173.35 38.522 3.32 39.999 9.60
256 128 2 768 1.591 321.77 35.918 7.13 37.509 20.47
256 128 4 1536 2.057 497.74 36.569 14.00 38.626 39.77
256 128 8 3072 3.011 680.28 37.398 27.38 40.409 76.02
256 128 16 6144 5.200 787.68 39.079 52.41 44.279 138.76
256 128 32 12288 10.379 789.28 36.840 111.19 47.219 260.24
256 128 64 24576 23.727 690.52 46.955 174.46 70.683 347.70
256 128 64 24576 23.727 690.52 46.955 174.46 70.683 347.70
256 256 1 512 1.451 176.41 76.953 3.33 78.404 6.53
256 256 2 1024 1.566 326.86 71.976 7.11 73.543 13.92
256 256 4 2048 2.048 500.02 73.273 13.98 75.321 27.19
256 256 8 4096 3.018 678.70 75.227 27.22 78.244 52.35
256 256 16 8192 5.254 779.61 78.877 51.93 84.131 97.37
256 256 32 16384 10.689 766.39 76.207 107.50 86.896 188.55
256 256 64 32768 24.541 667.61 100.662 162.76 125.203 261.72
256 512 1 768 1.486 172.27 153.922 3.33 155.408 4.94
256 512 2 1536 1.567 326.69 144.149 7.10 145.716 10.54
256 512 4 3072 2.039 502.09 147.154 13.92 149.193 20.59
256 512 8 6144 3.048 671.82 151.881 26.97 154.930 39.66
256 512 16 12288 5.335 767.83 161.522 50.72 166.856 73.64
256 512 32 24576 10.975 746.42 162.131 101.05 173.106 141.97
256 512 64 49152 25.694 637.65 225.436 145.35 251.130 195.72
512 128 1 640 1.586 322.87 38.599 3.32 40.185 15.93
512 128 2 1280 2.044 501.10 36.142 7.08 38.185 33.52
512 128 4 2560 3.015 679.31 36.719 13.94 39.734 64.43
512 128 8 5120 5.169 792.49 38.111 26.87 43.280 118.30
512 128 16 10240 10.270 797.70 40.407 50.68 50.677 202.07
512 128 32 20480 23.506 697.03 40.956 100.01 64.462 317.71
512 128 64 40960 61.032 536.90 57.373 142.78 118.406 345.93
512 256 1 768 1.615 316.96 77.101 3.32 78.717 9.76
512 256 2 1536 2.085 491.05 72.335 7.08 74.420 20.64
512 256 4 3072 3.105 659.54 73.618 13.91 76.723 40.04
512 256 8 6144 5.218 785.03 76.483 26.78 81.701 75.20
512 256 16 12288 10.455 783.58 82.056 49.92 92.511 132.83
512 256 32 24576 24.223 676.38 84.581 96.85 108.804 225.87
512 256 64 49152 63.445 516.48 121.450 134.90 184.895 265.84
512 512 1 1024 1.609 318.18 154.456 3.31 156.065 6.56
512 512 2 2048 2.058 497.64 144.961 7.06 147.018 13.93
512 512 4 4096 3.051 671.26 147.749 13.86 150.800 27.16
512 512 8 8192 5.312 771.05 154.219 26.56 159.532 51.35
512 512 16 16384 10.784 759.66 168.836 48.52 179.620 91.21
512 512 32 32768 25.506 642.36 180.051 91.00 205.557 159.41
512 512 64 65536 67.685 484.13 268.905 121.86 336.590 194.71

Without FA

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 1.387 92.29 40.706 3.14 42.093 6.08
128 128 2 512 1.465 174.69 38.280 6.69 39.745 12.88
128 128 4 1024 1.602 319.56 39.234 13.05 40.836 25.08
128 128 8 2048 2.169 472.21 41.386 24.74 43.555 47.02
128 128 16 4096 3.456 592.53 45.278 45.23 48.735 84.05
128 128 32 8192 6.656 615.43 47.898 85.51 54.554 150.16
128 128 64 16384 15.853 516.75 76.937 106.48 92.789 176.57
128 256 1 384 1.384 92.51 81.874 3.13 83.257 4.61
128 256 2 768 1.447 176.97 76.820 6.66 78.266 9.81
128 256 4 1536 1.602 319.65 79.248 12.92 80.850 19.00
128 256 8 3072 2.136 479.42 85.638 23.91 87.774 35.00
128 256 16 6144 3.387 604.71 96.047 42.65 99.434 61.79
128 256 32 12288 6.712 610.23 108.826 75.28 115.538 106.35
128 256 64 24576 16.136 507.69 188.571 86.89 204.707 120.05
128 512 1 640 1.400 91.42 164.687 3.11 166.087 3.85
128 512 2 1280 1.460 175.32 155.386 6.59 156.846 8.16
128 512 4 2560 1.638 312.57 163.515 12.52 165.153 15.50
128 512 8 5120 2.160 474.05 182.685 22.42 184.845 27.70
128 512 16 10240 3.411 600.48 211.340 38.76 214.750 47.68
128 512 32 20480 6.668 614.25 261.947 62.55 268.616 76.24
128 512 64 40960 16.306 502.39 514.376 63.70 530.682 77.18
256 128 1 384 1.471 174.08 41.127 3.11 42.598 9.01
256 128 2 768 1.612 317.57 38.593 6.63 40.205 19.10
256 128 4 1536 2.145 477.44 40.030 12.79 42.175 36.42
256 128 8 3072 3.382 605.63 44.344 23.09 47.725 64.37
256 128 16 6144 6.750 606.78 50.710 40.39 57.461 106.93
256 128 32 12288 16.186 506.12 60.817 67.35 77.003 159.58
256 128 64 24576 47.326 346.20 112.135 73.05 159.461 154.12
256 256 1 512 1.485 172.34 82.465 3.10 83.951 6.10
256 256 2 1024 1.706 300.14 77.652 6.59 79.358 12.90
256 256 4 2048 2.237 457.79 81.149 12.62 83.386 24.56
256 256 8 4096 3.383 605.44 91.353 22.42 94.736 43.24
256 256 16 8192 6.685 612.76 104.897 39.05 111.581 73.42
256 256 32 16384 16.194 505.87 130.307 62.87 146.501 111.84
256 256 64 32768 47.441 345.36 257.209 63.70 304.650 107.56
256 512 1 768 1.462 175.11 165.687 3.09 167.149 4.59
256 512 2 1536 1.633 313.50 156.948 6.52 158.581 9.69
256 512 4 3072 2.134 479.93 168.065 12.19 170.198 18.05
256 512 8 6144 3.382 605.61 194.019 21.11 197.400 31.12
256 512 16 12288 6.704 611.01 231.779 35.34 238.482 51.53
256 512 32 24576 16.299 502.62 311.565 52.59 327.863 74.96
256 512 64 49152 47.588 344.29 647.057 50.64 694.645 70.76
512 128 1 640 1.634 313.29 41.503 3.08 43.137 14.84
512 128 2 1280 2.159 474.34 39.422 6.49 41.581 30.78
512 128 4 2560 3.448 594.04 42.801 11.96 46.249 55.35
512 128 8 5120 6.730 608.64 49.955 20.50 56.684 90.32
512 128 16 10240 16.408 499.28 61.076 33.53 77.484 132.16
512 128 32 20480 48.166 340.16 84.175 48.66 132.341 154.75
512 128 64 40960 162.809 201.27 182.069 44.99 344.878 118.77
512 256 1 768 1.623 315.51 83.171 3.08 84.794 9.06
512 256 2 1536 2.148 476.77 79.253 6.46 81.400 18.87
512 256 4 3072 3.444 594.72 86.890 11.79 90.333 34.01
512 256 8 6144 6.737 607.95 102.615 19.96 109.352 56.19
512 256 16 12288 16.379 500.15 126.873 32.28 143.252 85.78
512 256 32 24576 48.092 340.68 182.106 44.98 230.199 106.76
512 256 64 49152 164.418 199.30 399.139 41.05 563.557 87.22
512 512 1 1024 1.626 314.88 167.303 3.06 168.929 6.06
512 512 2 2048 2.141 478.30 160.724 6.37 162.865 12.57
512 512 4 4096 3.401 602.20 179.202 11.43 182.603 22.43
512 512 8 8192 6.743 607.45 212.768 19.25 219.511 37.32
512 512 16 16384 16.412 499.16 268.024 30.56 284.436 57.60
512 512 32 32768 48.841 335.46 414.713 39.51 463.554 70.69
512 512 64 65536 163.954 199.86 933.081 35.12 1097.035 59.74

Copy link
Contributor

@noemotiovon noemotiovon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’ve implemented FlashAttention (FA) on CANN and provided a comprehensive test report — it looks excellent and is highly meaningful! Thank you so much to you and your colleagues for your valuable contributions to the llama.cpp project and your support for Huawei Ascend!


## TODO
- Support more models and data types.
- Support more models and d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here seems to be some documentation errors

ggml_cann_pool_alloc bcast_pse_allocator(ctx.pool());
void* bcast_pse_buffer = nullptr;
if(src3)
bcast_pse_buffer = bcast_pse_allocator.alloc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the memory allocation here be moved into the src3 != nullptr block below?

if(src3)
ggml_cann_release_resources(ctx, bcast_pse_tensor);
}else{
throw std::runtime_error("Function not implemented");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using GGML_ABORT("Function not implemented"); would be a better choice.

#include <string>
#include <cstring>

#include "aclnnop/aclnn_flash_attention_score.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary imports.

#include "aclnnop/aclnn_flash_attention_score.h"
#include "aclnnop/aclnn_logical_not.h"

@@ -72,12 +72,23 @@
#include <exception>
#include <vector>

#include <iostream>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary imports.

#include <iostream>
#include <fstream>
#include <string>
#include <cstring>

@@ -45,6 +45,8 @@
#include <aclnnop/aclnn_cos.h>
#include <aclnnop/aclnn_log.h>
#include <aclnnop/aclnn_sign.h>
#include <aclnnop/aclnn_fused_infer_attention_score_v2.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving this #include <aclnnop/aclnn_fused_infer_attention_score_v2.h> to the aclnn_ops.cpp file, and if we don't need aclnn_isneginf, It can be removed.

@shibizhao
Copy link
Author

We have updated the files according to the review comments. Thanks for your time. @noemotiovon @hipudding

Copy link
Contributor

@noemotiovon noemotiovon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just noticed a few minor issues.
I pulled your latest code and tested the FA operator using a script, but encountered the following problems.
Could you please help me check the cause? Thank you so much!
Environment:

910B3
CANN 8.1 RC1

Script:

./bin/test-backend-ops test -b CANN0 -o FLASH_ATTN_EXT

Error:

Backend 1/2: CANN0
ggml_backend_cann_context: device 0 async operator submission is OFF
  Device description: Ascend910B3
  Device memory: 62432 MB (62147 MB free)

  FLASH_ATTN_EXT(hsk=64,hsv=64,nh=4,nr=1,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f16,permute=[0,1,2,3]): new_pool_for_device: device 0 use vmm pool
CANN error: EZ9999: Inner Error!
EZ9999: [PID: 3008073] 2025-05-21-09:49:04.442.578 precision mode[2] should be 0 or 1[FUNC:InputAttrsPreProcess][FILE:incre_flash_attention_tiling.cc][LINE:303]
        TraceBack (most recent call last):
       FusedInferAttentionScore do tiling failed, ret is -1.
       Check NnopbaseExecutorDoTiling(executor) failed
       Check NnopbaseExecutorTilingAndUpdateBinInfo(executor) failed
       Check NnopbaseExecutorMatchCache(executor) failed
       Check NnopbaseRunForWorkspace(*executor, workspaceSize) failed

  current device: 0, in function ggml_cann_flash_attn_ext at /home/cmq/lcg/github/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2858
  aclnnFusedInferAttentionScoreV2GetWorkspaceSize(acl_q_tensor, acl_k_tensor_list, acl_v_tensor_list, bcast_pse_tensor, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, numHeads, scaleValue, preTokens, nextTokens, layout, numKeyValueHeads, sparseMode, innerPrecise, blockSize, antiquantMode, softmaxLseFlag, keyAntiquantMode, valueAntiquantMode, acl_dst_f16_tensor, nullptr, &workspaceSize, &executor)
/home/cmq/lcg/github/llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:65: CANN error

aclTensor* acl_src0_f16_tensor = nullptr;
aclTensor* acl_src1_f16_tensor = nullptr;
aclTensor* acl_src2_f16_tensor = nullptr;
aclTensor* acl_src3_f16_tensor = nullptr;
Copy link
Contributor

@noemotiovon noemotiovon May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable acl_src3_f16_tensor is not used and can likely be removed.

GGML_ABORT("Function not implemented");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line appears to contain some unexpected or strange characters.

@shibizhao
Copy link
Author

shibizhao commented May 21, 2025

Thanks for your reply. In 8.0.RC2, there is no CANN inner error.
To re-produce this error, we rented an Ascend 910B with 8.0.0 version.
Now, we have fixed this bug according to your feedback.

Copy link
Contributor

@noemotiovon noemotiovon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. The previous issue has been resolved after your fix—thank you very much for your contribution! There are still a few logic points here that I'd like you to help confirm.

if (op->src[0]->ne[0] == 64 && op->src[1]->type == GGML_TYPE_F16) {
return true;
}
if (op->src[0]->ne[0] == 128) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a logical issue here. Currently, it seems that when op->src[0]->ne[0] == 128, the code allows kv to have a data type like q4/q8, implying that this case is supported. However, quantized formats are actually not supported at the moment. I believe the logic should be adjusted accordingly to reflect this:

if (op->src[0]->ne[0] != 128) {
	return false;
}

Could you please help confirm the logic?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have updated the if-else logic to pass all of the tests.

}
if (op->src[0]->ne[0] == 256 && op->src[1]->type == GGML_TYPE_F16 && op->src[2]->type == GGML_TYPE_F16) {
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the current FA doesn't support cases where logitSoftcap is not equal to 0, so we should add a check to ensure logitSoftcap equals 0 here, as shown in the code below.

float logitSoftcap = 0.0f;
memcpy(&logitSoftcap,  (float*)op->op_params + 2, sizeof(float));
if(logitSoftcap != 0.0f) {
	return false;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. We have added it.

@shibizhao
Copy link
Author

Thanks for your comments. We have updated the ggml-cann.cpp.

@noemotiovon
Copy link
Contributor

LGTM! Thank you for your outstanding contribution! 😊
cc @hipudding

@hipudding
Copy link
Collaborator

@shibizhao Please resolve conflicts. Thanks.

@shibizhao
Copy link
Author

@shibizhao Please resolve conflicts. Thanks.

Thanks. We think we have resolved the conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ascend NPU issues specific to Ascend NPUs documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants