Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cpu prompt eval speed #6414

Merged
merged 1 commit into from
Apr 16, 2024
Merged

Improve cpu prompt eval speed #6414

merged 1 commit into from
Apr 16, 2024

Conversation

jart
Copy link
Contributor

@jart jart commented Apr 1, 2024

This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/

@phymbert
Copy link
Collaborator

phymbert commented Apr 1, 2024

Please fix the CI builds

Copy link
Contributor

github-actions bot commented Apr 1, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 517 iterations 🚀

  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=9017.64ms p(90)=25806.42ms fails=0, finish reason: stop=517 truncated=0
  • Prompt processing (pp): avg=238.16tk/s p(90)=711.8tk/s total=203.92tk/s
  • Token generation (tg): avg=95.48tk/s p(90)=248.93tk/s total=128.87tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sgemm commit=8dbe58213391399b2e3b60b5b116b5dd6b864f96
Time series

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 517 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712197520 --> 1712198142
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 482.02, 482.02, 482.02, 482.02, 482.02, 480.93, 480.93, 480.93, 480.93, 480.93, 542.27, 542.27, 542.27, 542.27, 542.27, 599.19, 599.19, 599.19, 599.19, 599.19, 603.04, 603.04, 603.04, 603.04, 603.04, 623.99, 623.99, 623.99, 623.99, 623.99, 631.06, 631.06, 631.06, 631.06, 631.06, 631.12, 631.12, 631.12, 631.12, 631.12, 639.77, 639.77, 639.77, 639.77, 639.77, 656.83, 656.83, 656.83, 656.83, 656.83, 681.01, 681.01, 681.01, 681.01, 681.01, 677.07, 677.07, 677.07, 677.07, 677.07, 675.32, 675.32, 675.32, 675.32, 675.32, 640.6, 640.6, 640.6, 640.6, 640.6, 644.24, 644.24, 644.24, 644.24, 644.24, 643.62, 643.62, 643.62, 643.62, 643.62, 644.73, 644.73, 644.73, 644.73, 644.73, 643.23, 643.23, 643.23, 643.23, 643.23, 643.88, 643.88, 643.88, 643.88, 643.88, 643.15, 643.15, 643.15, 643.15, 643.15, 639.49, 639.49, 639.49, 639.49, 639.49, 638.15, 638.15, 638.15, 638.15, 638.15, 653.61, 653.61, 653.61, 653.61, 653.61, 654.36, 654.36, 654.36, 654.36, 654.36, 655.49, 655.49, 655.49, 655.49, 655.49, 664.2, 664.2, 664.2, 664.2, 664.2, 662.75, 662.75, 662.75, 662.75, 662.75, 660.8, 660.8, 660.8, 660.8, 660.8, 657.18, 657.18, 657.18, 657.18, 657.18, 656.13, 656.13, 656.13, 656.13, 656.13, 661.91, 661.91, 661.91, 661.91, 661.91, 660.87, 660.87, 660.87, 660.87, 660.87, 664.56, 664.56, 664.56, 664.56, 664.56, 676.0, 676.0, 676.0, 676.0, 676.0, 675.46, 675.46, 675.46, 675.46, 675.46, 680.42, 680.42, 680.42, 680.42, 680.42, 681.0, 681.0, 681.0, 681.0, 681.0, 680.06, 680.06, 680.06, 680.06, 680.06, 679.96, 679.96, 679.96, 679.96, 679.96, 684.29, 684.29, 684.29, 684.29, 684.29, 690.43, 690.43, 690.43, 690.43, 690.43, 680.82, 680.82, 680.82, 680.82, 680.82, 681.94, 681.94, 681.94, 681.94, 681.94, 681.94, 681.94, 681.94, 681.94, 681.94, 679.82, 679.82, 679.82, 679.82, 679.82, 678.17, 678.17, 678.17, 678.17, 678.17, 675.86, 675.86, 675.86, 675.86, 675.86, 679.6, 679.6, 679.6, 679.6, 679.6, 678.27, 678.27, 678.27, 678.27, 678.27, 678.13, 678.13, 678.13, 678.13, 678.13, 670.68, 670.68, 670.68, 670.68, 670.68, 670.78, 670.78, 670.78, 670.78, 670.78, 671.74, 671.74, 671.74, 671.74, 671.74, 672.1, 672.1, 672.1, 672.1, 672.1, 673.35, 673.35, 673.35, 673.35, 673.35, 672.79, 672.79, 672.79, 672.79, 672.79, 675.71, 675.71, 675.71, 675.71, 675.71, 671.67, 671.67, 671.67, 671.67, 671.67, 673.01, 673.01, 673.01, 673.01, 673.01, 672.08, 672.08]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 517 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712197520 --> 1712198142
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 28.56, 28.56, 28.56, 28.56, 28.56, 16.27, 16.27, 16.27, 16.27, 16.27, 17.36, 17.36, 17.36, 17.36, 17.36, 17.94, 17.94, 17.94, 17.94, 17.94, 19.34, 19.34, 19.34, 19.34, 19.34, 20.0, 20.0, 20.0, 20.0, 20.0, 20.07, 20.07, 20.07, 20.07, 20.07, 20.14, 20.14, 20.14, 20.14, 20.14, 19.99, 19.99, 19.99, 19.99, 19.99, 19.95, 19.95, 19.95, 19.95, 19.95, 19.93, 19.93, 19.93, 19.93, 19.93, 19.71, 19.71, 19.71, 19.71, 19.71, 19.45, 19.45, 19.45, 19.45, 19.45, 18.37, 18.37, 18.37, 18.37, 18.37, 18.44, 18.44, 18.44, 18.44, 18.44, 18.6, 18.6, 18.6, 18.6, 18.6, 18.69, 18.69, 18.69, 18.69, 18.69, 18.57, 18.57, 18.57, 18.57, 18.57, 18.4, 18.4, 18.4, 18.4, 18.4, 18.32, 18.32, 18.32, 18.32, 18.32, 18.18, 18.18, 18.18, 18.18, 18.18, 18.28, 18.28, 18.28, 18.28, 18.28, 18.32, 18.32, 18.32, 18.32, 18.32, 18.25, 18.25, 18.25, 18.25, 18.25, 18.36, 18.36, 18.36, 18.36, 18.36, 18.41, 18.41, 18.41, 18.41, 18.41, 18.36, 18.36, 18.36, 18.36, 18.36, 18.3, 18.3, 18.3, 18.3, 18.3, 18.01, 18.01, 18.01, 18.01, 18.01, 18.01, 18.01, 18.01, 18.01, 18.01, 18.06, 18.06, 18.06, 18.06, 18.06, 18.14, 18.14, 18.14, 18.14, 18.14, 18.25, 18.25, 18.25, 18.25, 18.25, 18.28, 18.28, 18.28, 18.28, 18.28, 18.24, 18.24, 18.24, 18.24, 18.24, 18.26, 18.26, 18.26, 18.26, 18.26, 18.12, 18.12, 18.12, 18.12, 18.12, 18.1, 18.1, 18.1, 18.1, 18.1, 18.13, 18.13, 18.13, 18.13, 18.13, 18.17, 18.17, 18.17, 18.17, 18.17, 18.24, 18.24, 18.24, 18.24, 18.24, 18.2, 18.2, 18.2, 18.2, 18.2, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 17.82, 17.82, 17.82, 17.82, 17.82, 17.78, 17.78, 17.78, 17.78, 17.78, 17.43, 17.43, 17.43, 17.43, 17.43, 17.11, 17.11, 17.11, 17.11, 17.11, 17.13, 17.13, 17.13, 17.13, 17.13, 17.21, 17.21, 17.21, 17.21, 17.21, 17.28, 17.28, 17.28, 17.28, 17.28, 17.33, 17.33, 17.33, 17.33, 17.33, 17.38, 17.38, 17.38, 17.38, 17.38, 17.41, 17.41, 17.41, 17.41, 17.41, 17.4, 17.4, 17.4, 17.4, 17.4, 17.39, 17.39, 17.39, 17.39, 17.39, 17.35, 17.35, 17.35, 17.35, 17.35, 17.33, 17.33, 17.33, 17.33, 17.33, 17.36, 17.36, 17.36, 17.36, 17.36, 17.44, 17.44]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 517 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712197520 --> 1712198142
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.33, 0.33, 0.33, 0.33, 0.33, 0.07, 0.07, 0.07, 0.07, 0.07, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.3, 0.3, 0.3, 0.3, 0.3, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.31, 0.31, 0.31, 0.31, 0.31, 0.34, 0.34, 0.34, 0.34, 0.34, 0.37, 0.37, 0.37, 0.37, 0.37, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.23, 0.23, 0.23, 0.23, 0.23, 0.35, 0.35, 0.35, 0.35, 0.35, 0.45, 0.45, 0.45, 0.45, 0.45, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.53, 0.53, 0.53, 0.53, 0.53, 0.36, 0.36, 0.36, 0.36, 0.36, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 517 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712197520 --> 1712198142
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0]
                    
Loading

@JohannesGaessler
Copy link
Collaborator

Some very quick tests on my Ryzen 5950X (power limited to 95 W):

Model Threads Test t/s master t/s Speedup
llama 7B Q4_0 16 pp 512 24.60 32.34 1.31
llama 7B Q4_0 16 tg 128 9.75 9.86 1.01
llama 7B F16 16 pp 512 27.70 42.74 1.54
llama 7B F16 16 tg 128 3.20 3.19 1.00

A very respectable speedup!

Since you did not mention it in the OP, this PR does not touch the handling of NUMA nodes, correct?

@kalomaze
Copy link
Contributor

kalomaze commented Apr 1, 2024

Is this not yet set up to support the CPU code used in partial GPU offloading? Will those require custom kernels?

@JohannesGaessler
Copy link
Collaborator

Is this not yet set up to support the CPU code used in partial GPU offloading? Will those require custom kernels?

This PR will not speed up CPU+GPU hybrid inference in any meaningful capacity. For large batches you are compute bound and all of the evaluations are done on the GPU. For small batches you are I/O bound and better matrix multiplication algorithms make virtually no difference.

@kalomaze
Copy link
Contributor

kalomaze commented Apr 1, 2024

For large batches you are compute bound and all of the evaluations are done on the GPU.

Does this mean it moves layers onto the GPU for large batches instead of processing all GPU layers for the current batch and then doing the remaining layers on CPU? I'm sort of lost, this works against my current understanding (moving from CPU to GPU during inference should be slower)?

@JohannesGaessler
Copy link
Collaborator

CPU layers have their data in RAM. GPU layers have their data in VRAM. GPU layers are always evaluated on the GPU.

The most recent update is this PR: #6083 . For large batch sizes (prompt processing) all data of a CPU layer is moved to the GPU and the calculations are done there in order to make use of the higher GPU compute. For small batch sizes (token generation) CPU layers are evaluated on the CPU. This PR improves the compute efficiency of CPU matrix multiplication. So it only helps in those scenarios where it would also be worthwhile to temporarily move data to VRAM. The improvements in this PR are therefore mutually exclusive with CPU+GPU hybrid inference.

@zougloub
Copy link

zougloub commented Apr 1, 2024

@jart this is pretty awesome ; I would add that since a good portion of the contributed code is very generic and could benefit to many other downstream projects, it would be even more awesome if that code could be in its own repo ; then a subset could be linked or vendored in here.

@jart
Copy link
Contributor Author

jart commented Apr 1, 2024

@phymbert Tests are green. Please take a look.

@phymbert
Copy link
Collaborator

phymbert commented Apr 1, 2024

@phymbert Tests are green. Please take a look.

Thank you very much for the contribution. For the core library and ggml changes @slaren and @ggerganov will revert to you.

@jart
Copy link
Contributor Author

jart commented Apr 1, 2024

@zougloub Thank you for the encouragement. You can copy sgemm.cpp into your codebase as its own library if you provide an implementation for GGML_FP16_TO_FP32(). It would be challenging to create a bona fide library for this, because GEMM has more depth the more stakeholders you have. This code is written to focus only on what's good for llama.cpp and nothing else. The parallel implementation in the llamafile codebase does things a little differently, based on what's best there.

@ggerganov
Copy link
Owner

@jart Apologies for the slow response - will review the PRs in the following days. Thanks

@jart
Copy link
Contributor Author

jart commented Apr 1, 2024

Thanks @ggerganov I'm in no rush.

@netrunnereve
Copy link
Collaborator

This PR did absolutely nothing for me on Q4_0 and Q8_0, then I realised that it only supported AVX2 and AVX512 for those quants. It does support regular AVX though for F16 and F32.

On my 4c/8t Xeon v2 I get a nice 2x speedup in F16. Just like vanilla llama.cpp you get the best CPU performance if you use all hyperthreads during prompt processing and switch to one thread per core for inference.

model size params backend threads test t/s
llama 1B F16 2.05 GiB 1.10 B CPU 8 pp 512 33.36 ± 0.13
llama 1B F16 2.05 GiB 1.10 B CPU 4 pp 512 32.19 ± 0.02
llama 1B F16 (PR) 2.05 GiB 1.10 B CPU 8 pp 512 60.47 ± 0.06
llama 1B F16 (PR) 2.05 GiB 1.10 B CPU 4 pp 512 52.88 ± 0.12

@JohannesGaessler
Copy link
Collaborator

@netrunnereve in case you're not aware, you can run ./llama-bench -o sql | sqlite3 llama-bench.sqlite both on master and a PR and then scripts/compare-llama-bench.py to generate a table with a performance comparison.

@JeremyGe07
Copy link

Does this PR benefit ARM CPU?

ggml.c Outdated
return;
}
UseGgmlGemm1:
(void)0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

; ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's to avoid the compiler complaining if the label comes before a variable declaration.

Copy link

@tstanisl tstanisl Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But using just ; is a simple way to achieve the same goal without introducing dummy expressions.. Just do:

UseGgmlGemm1: ;

It works perfectly fine since C99 standard. See https://godbolt.org/z/6sbKhnhW9 .

BTW. This issue with labels is fixed in C23 standard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tstanisl, you taught me something I didn't know. Fixed!

ggml.c Outdated
return;
}
UseGgmlGemm2:
(void)0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@sorasoras
Copy link

sorasoras commented Apr 2, 2024

Does this PR benefit ARM CPU?

I think so.
it has to be ARMV8.2+ I guess.
https://justine.lol/matmul/
image

@jart jart mentioned this pull request Apr 2, 2024
@Jipok
Copy link

Jipok commented Apr 2, 2024

Should I have acceleration for Q8 if I only have AVX and AVX2? I tested and found no differences.
Do I need to build with some kind of blas?

@phymbert
Copy link
Collaborator

phymbert commented Apr 2, 2024

https://justine.lol/matmul/ is a must read ^^) Thank you @jart, you got a new Patron

@moshemalawach
Copy link

Using it on many CPU setups and it speeds up everything on context processing!

@ZelinMa557
Copy link

Can these kernels make the token generation faster?

@lin72h
Copy link

lin72h commented Apr 8, 2024

Can these kernels make the token generation faster?

I think it probably not, because token generation is memory bandwidth bound

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU speedups are always welcome, but I’m worried about the maintenance efforts for the core ggml library increasing, so I’m still hesitating how to proceed with this PR.

Similar discussion was already had in #5780 and there are likely to be other matrix-multiplication improvements proposed:

This change on one hand is well decoupled which is good, but at the same time introduces a new block-wise matrix-multiplication pattern that is different from the existing dot-based implementations. It’s obviously significantly more performant since it utilizes the CPU cache much more efficiently, which has not been the case so far. It also seems that the implementation can be extended to more instruction sets and quantum types in the future, so the amount of code has the potential to grow significantly.

The code is also in C++, while we generally prefer to keep the core implementation in C and allow C++ only in the backend implementations when desired. I’ve been pretty stubborn with this C requirement and it’s probably something to finally reconsider, but it’s not the time to decide in this PR.

I don’t want to delay this for much longer as I’ve already given this quite some thought and haven’t come to a good conclusion. I think the comments in #5780 apply to a good extend here (PTAL), so my suggestion is that we aim for this to become part of the future BLAS/matmul backend. The benefit of doing that is that the code becomes sort of an "extension" to ggml and can be developed more independently, without drawing a lot of attention from the core maintainers.

In the meantime, we can merge this change and depending on how the development process goes (i.e. there is enough support from the community, bugs and issues are being resolved, functionality is reasonably extended, remains well decoupled from the rest of the code) we can potentially consider to make this part of the core ggml library. But until then it will remain sort of a "second-class citizen".

@jart If that makes sense, we would need to put the ggml.c change behind a define (e.g. GGML_USE_TINYBLAS or GGML_USE_LLAMAFILE or something like this), so that the sgemm code becomes optional (we generally avoid such special cases, but we can make an exception this time). In llama.cpp builds we can have this enabled by default as it seems it is always better than the alternatives. This way, llamafile and other downstream projects can directly benefit from the changes, and we'll have more time to figure out what is the right way to integrate this into ggml.

If you are OK with that, we can proceed to merge

common/common.cpp Outdated Show resolved Hide resolved
Comment on lines 161 to 168
if (cpu_count < 1)
return get_num_physical_cores();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (cpu_count < 1)
return get_num_physical_cores();
if (cpu_count < 1) {
return get_num_physical_cores();
}

common/common.cpp Outdated Show resolved Hide resolved
sgemm.cpp Outdated
case GGML_TYPE_Q8_0: {
if (k % 32)
return false;
if (Btype != GGML_TYPE_Q8_0)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Btype != GGML_TYPE_Q8_0)
if (Btype != GGML_TYPE_Q8_0)

sgemm.cpp Outdated Show resolved Hide resolved
@jart
Copy link
Contributor Author

jart commented Apr 11, 2024

Sounds good @ggerganov. Review comments addressed in 492b76d PTAL

@jart
Copy link
Contributor Author

jart commented Apr 11, 2024

Also just want to draw attention to the loosening of the src1_cont restriction. Could you confirm that's correct?

@jart
Copy link
Contributor Author

jart commented Apr 11, 2024

Another thing worth mentioning, possibly for future iterations is that:

    template <int RM, int RN> void gemm(int m0, int m, int n0, int n) {
        int ytiles = (m - m0) / RM;
        int xtiles = (n - n0) / RN;
        int tiles = xtiles * ytiles;
        int duty = (tiles + nth - 1) / nth;
        int start = duty * ith;
        int end = start + duty;
        if (end > tiles)
            end = tiles;
        for (int job = start; job < end; ++job) {
            int ii = m0 + job / xtiles * RM;
            int jj = n0 + job % xtiles * RN;
            D Cv[RN][RM] = {0};
            for (int l = 0; l < k; l += KN)
                for (int j = 0; j < RN; ++j)
                    for (int i = 0; i < RM; ++i)
                        Cv[j][i] = madd(load(A + lda * (ii + i) + l), //
                                        load(B + ldb * (jj + j) + l), //
                                        Cv[j][i]);
            TC Cd[RN][RM];
            for (int j = 0; j < RN; ++j)
                for (int i = 0; i < RM; ++i)
                    Cd[j][i] = hsum(Cv[j][i]);
            for (int j = 0; j < RN; ++j)
                for (int i = 0; i < RM; ++i)
                    C[ldc * (jj + j) + (ii + i)] = Cd[j][i];
        }
    }

Is able to generate the handwritten kernels in the tinyBLAS class. This makes it possible to generate an mnpack() method that optimally handles all edge cases for weirdly shaped n and m values. See https://gist.github.com/jart/640231a627dfbd02fb03e23e8b01e592#file-matmul-cpp-L295-L609 for an example. The issue is that Clang takes 45 seconds to compile it. Would you want me to simplify the code so it's more abstract but potentially slower to compile?

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that Clang takes 45 seconds to compile it.

Not a good idea - the build time should not increase noticeably after these changes.

I did some more tests on M2 Ultra. Generally, text-generation (batch size = 1) and prompt processing speed (batch size > 256) are the most important metrics to look at, but keeping an eye on the performance for low-sized batches is also important (e.g. parallel decoding, speculative decoding, etc.)

The following command will give you the speed for various batch sizes:

./llama-bench -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf -ngl 0 -p 1,2,3,4,5,6,7,8,12,16,32,64,512 -n 0 -r 50 -t 16

These are the numbers with the llamafile SGEMM disabled:

LLAMA_NO_LLAMAFILE=1 LLAMA_NO_ACCELERATE=1 make -j llama-bench && ./llama-bench -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf -ngl 0 -p 1,2,3,4,5,6,7,8,12,16,32,64,512 -n 0 -r 50 -t 16
model size params backend ngl test t/s
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 1 15.67 ± 0.25
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 2 26.14 ± 0.78
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 3 32.99 ± 0.29
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 4 37.72 ± 0.48
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 5 39.51 ± 0.61
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 6 43.78 ± 0.50
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 7 45.72 ± 1.26
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 8 47.13 ± 1.35
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 12 51.81 ± 0.53
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 16 53.54 ± 1.59
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 32 55.89 ± 0.46
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 64 57.53 ± 0.31
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 512 58.16 ± 0.22

build: 492b76d (2645)

This is the same bench with llamafile SGEMM enabled:

LLAMA_NO_ACCELERATE=1 make -j llama-bench && ./llama-bench -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf -ngl 0 -p 1,2,3,4,5,6,7,8,12,16,32,64,512 -n 0 -r 50 -t 16
model size params backend ngl test t/s
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 1 15.48 ± 0.73
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 2 25.94 ± 0.59
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 3 32.57 ± 1.29
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 4 37.63 ± 0.57
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 5 40.86 ± 1.22
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 6 43.59 ± 0.75
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 7 45.92 ± 0.40
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 8 33.38 ± 0.56
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 12 53.02 ± 0.58
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 16 69.40 ± 1.32
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 32 78.17 ± 0.57
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 64 101.11 ± 0.26
llama 7B F16 13.49 GiB 7.24 B Metal 0 pp 512 101.94 ± 0.70

build: 492b76d (2645)

For BS < 8 there is no difference since the SGEMM routines are not used, but at BS = 8 the SGEMM performs worse to mainline. Maybe there's room for improvement there.

It's also a good idea before merging to run some perplexity tests with F16 and Q4_0 7B LLaMA models to verify that the numbers are within expectation:

# use ./scripts/get-wikitext-2.sh to get wiki test data

# run ppl (can take a while)
./perplexity -f wikitext-2-raw/wiki.test.raw -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf 

Comment on lines 158 to 159
-e 's/src\/sgemm\.cpp/sgemm.cpp/g' \
-e 's/src\/sgemm\.h/sgemm.h/g' \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to sync upstream for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ggml.c Outdated
Comment on lines 10819 to 10825
if (nb10 == ggml_type_size(src1->type)) {
for (int64_t j = 0; j < ne13; j++)
for (int64_t i = 0; i < ne12; i++)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition should be sufficient.

Instead of i and j use i12 and i13

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Djip007
Copy link
Contributor

Djip007 commented Apr 11, 2024

@jart your work is wonderfull. and I think there is room for more optimisation. But some may need more control on this operator.
@ggerganov is worried with the size / maintenance of ggml core.

But what if "TINYBLAS" is added as a backend (think like simd_backend...)
I've spent the last few days trying to figure out the design of llama.cpp and backend. Look that if "TINYBLAS" is a backend you can have even more control over what you can implement (choice of block size, storage architecture, etc.)

[Append]: I read this PR: #5780 (comment)
It seems that there are already discussions about how to handle rearranged tensor and the use of new backends ..

@jart jart force-pushed the sgemm branch 2 times, most recently from 79705b2 to 2b83bf5 Compare April 16, 2024 02:45
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
@jart
Copy link
Contributor Author

jart commented Apr 16, 2024

@ggerganov Since my change doesn't help much on M2, I changed it to be off by default on that platform.

#ifndef GGML_USE_LLAMAFILE
#ifdef __ARM_FEATURE_MATMUL_INT8
#define GGML_USE_LLAMAFILE 0
#else
#define GGML_USE_LLAMAFILE 1
#endif
#endif

PTAL

Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 462 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=10248.29ms p(95)=27753.38ms fails=, finish reason: stop=409 truncated=53
  • Prompt processing (pp): avg=113.07tk/s p(95)=512.58tk/s
  • Token generation (tg): avg=23.87tk/s p(95)=36.95tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sgemm commit=183c4bb3656f2842a8871df25e6fb8e1abe18f3f

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 462 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713237376 --> 1713238010
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 413.54, 413.54, 413.54, 413.54, 413.54, 473.28, 473.28, 473.28, 473.28, 473.28, 434.54, 434.54, 434.54, 434.54, 434.54, 458.69, 458.69, 458.69, 458.69, 458.69, 487.66, 487.66, 487.66, 487.66, 487.66, 538.06, 538.06, 538.06, 538.06, 538.06, 541.98, 541.98, 541.98, 541.98, 541.98, 542.13, 542.13, 542.13, 542.13, 542.13, 568.74, 568.74, 568.74, 568.74, 568.74, 576.84, 576.84, 576.84, 576.84, 576.84, 578.57, 578.57, 578.57, 578.57, 578.57, 586.56, 586.56, 586.56, 586.56, 586.56, 608.9, 608.9, 608.9, 608.9, 608.9, 616.35, 616.35, 616.35, 616.35, 616.35, 624.34, 624.34, 624.34, 624.34, 624.34, 609.54, 609.54, 609.54, 609.54, 609.54, 606.37, 606.37, 606.37, 606.37, 606.37, 606.82, 606.82, 606.82, 606.82, 606.82, 610.89, 610.89, 610.89, 610.89, 610.89, 610.73, 610.73, 610.73, 610.73, 610.73, 624.11, 624.11, 624.11, 624.11, 624.11, 627.85, 627.85, 627.85, 627.85, 627.85, 620.81, 620.81, 620.81, 620.81, 620.81, 620.59, 620.59, 620.59, 620.59, 620.59, 626.44, 626.44, 626.44, 626.44, 626.44, 627.28, 627.28, 627.28, 627.28, 627.28, 631.57, 631.57, 631.57, 631.57, 631.57, 645.0, 645.0, 645.0, 645.0, 645.0, 642.83, 642.83, 642.83, 642.83, 642.83, 646.15, 646.15, 646.15, 646.15, 646.15, 647.61, 647.61, 647.61, 647.61, 647.61, 652.23, 652.23, 652.23, 652.23, 652.23, 652.74, 652.74, 652.74, 652.74, 652.74, 652.66, 652.66, 652.66, 652.66, 652.66, 653.88, 653.88, 653.88, 653.88, 653.88, 659.16, 659.16, 659.16, 659.16, 659.16, 659.52, 659.52, 659.52, 659.52, 659.52, 659.72, 659.72, 659.72, 659.72, 659.72, 662.16, 662.16, 662.16, 662.16, 662.16, 662.14, 662.14, 662.14, 662.14, 662.14, 670.43, 670.43, 670.43, 670.43, 670.43, 675.83, 675.83, 675.83, 675.83, 675.83, 681.14, 681.14, 681.14, 681.14, 681.14, 684.43, 684.43, 684.43, 684.43, 684.43, 683.68, 683.68, 683.68, 683.68, 683.68, 683.12, 683.12, 683.12, 683.12, 683.12, 681.99, 681.99, 681.99, 681.99, 681.99, 685.48, 685.48, 685.48, 685.48, 685.48, 687.78, 687.78, 687.78, 687.78, 687.78, 689.88, 689.88, 689.88, 689.88, 689.88, 685.92, 685.92, 685.92, 685.92, 685.92, 671.1, 671.1, 671.1, 671.1, 671.1, 670.62, 670.62, 670.62, 670.62, 670.62, 669.47, 669.47, 669.47, 669.47, 669.47, 669.37, 669.37, 669.37, 669.37, 669.37, 663.64, 663.64, 663.64, 663.64, 663.64, 666.21, 666.21, 666.21, 666.21, 666.21, 666.59, 666.59, 666.59, 666.59, 666.59, 670.74, 670.74, 670.74, 670.74, 670.74, 671.14, 671.14, 671.14, 671.14, 671.14, 672.09, 672.09, 672.09, 672.09, 672.09, 672.09, 672.09, 672.09]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 462 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713237376 --> 1713238010
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 33.07, 33.07, 33.07, 33.07, 33.07, 33.11, 33.11, 33.11, 33.11, 33.11, 22.34, 22.34, 22.34, 22.34, 22.34, 22.6, 22.6, 22.6, 22.6, 22.6, 23.07, 23.07, 23.07, 23.07, 23.07, 23.18, 23.18, 23.18, 23.18, 23.18, 23.3, 23.3, 23.3, 23.3, 23.3, 24.38, 24.38, 24.38, 24.38, 24.38, 25.48, 25.48, 25.48, 25.48, 25.48, 25.49, 25.49, 25.49, 25.49, 25.49, 25.42, 25.42, 25.42, 25.42, 25.42, 24.67, 24.67, 24.67, 24.67, 24.67, 24.58, 24.58, 24.58, 24.58, 24.58, 24.48, 24.48, 24.48, 24.48, 24.48, 23.98, 23.98, 23.98, 23.98, 23.98, 23.81, 23.81, 23.81, 23.81, 23.81, 23.32, 23.32, 23.32, 23.32, 23.32, 23.0, 23.0, 23.0, 23.0, 23.0, 22.85, 22.85, 22.85, 22.85, 22.85, 23.13, 23.13, 23.13, 23.13, 23.13, 23.16, 23.16, 23.16, 23.16, 23.16, 22.72, 22.72, 22.72, 22.72, 22.72, 22.49, 22.49, 22.49, 22.49, 22.49, 22.22, 22.22, 22.22, 22.22, 22.22, 22.14, 22.14, 22.14, 22.14, 22.14, 22.02, 22.02, 22.02, 22.02, 22.02, 22.07, 22.07, 22.07, 22.07, 22.07, 22.24, 22.24, 22.24, 22.24, 22.24, 22.16, 22.16, 22.16, 22.16, 22.16, 22.32, 22.32, 22.32, 22.32, 22.32, 22.38, 22.38, 22.38, 22.38, 22.38, 22.37, 22.37, 22.37, 22.37, 22.37, 22.17, 22.17, 22.17, 22.17, 22.17, 22.04, 22.04, 22.04, 22.04, 22.04, 22.11, 22.11, 22.11, 22.11, 22.11, 22.4, 22.4, 22.4, 22.4, 22.4, 22.51, 22.51, 22.51, 22.51, 22.51, 22.64, 22.64, 22.64, 22.64, 22.64, 22.72, 22.72, 22.72, 22.72, 22.72, 22.74, 22.74, 22.74, 22.74, 22.74, 22.63, 22.63, 22.63, 22.63, 22.63, 22.51, 22.51, 22.51, 22.51, 22.51, 22.5, 22.5, 22.5, 22.5, 22.5, 22.3, 22.3, 22.3, 22.3, 22.3, 22.27, 22.27, 22.27, 22.27, 22.27, 22.26, 22.26, 22.26, 22.26, 22.26, 22.25, 22.25, 22.25, 22.25, 22.25, 22.34, 22.34, 22.34, 22.34, 22.34, 22.54, 22.54, 22.54, 22.54, 22.54, 22.61, 22.61, 22.61, 22.61, 22.61, 22.42, 22.42, 22.42, 22.42, 22.42, 22.15, 22.15, 22.15, 22.15, 22.15, 22.08, 22.08, 22.08, 22.08, 22.08, 21.89, 21.89, 21.89, 21.89, 21.89, 21.35, 21.35, 21.35, 21.35, 21.35, 21.33, 21.33, 21.33, 21.33, 21.33, 21.29, 21.29, 21.29, 21.29, 21.29, 21.36, 21.36, 21.36, 21.36, 21.36, 21.45, 21.45, 21.45, 21.45, 21.45, 21.47, 21.47, 21.47, 21.47, 21.47, 21.61, 21.61, 21.61, 21.61, 21.61, 21.66, 21.66, 21.66]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 462 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713237376 --> 1713238010
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.4, 0.4, 0.4, 0.4, 0.4, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.29, 0.29, 0.29, 0.29, 0.29, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.32, 0.32, 0.32, 0.32, 0.32, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.31, 0.31, 0.31, 0.31, 0.31, 0.39, 0.39, 0.39, 0.39, 0.39, 0.41, 0.41, 0.41, 0.41, 0.41, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.35, 0.35, 0.35, 0.35, 0.35, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 462 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713237376 --> 1713238010
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0]
                    
Loading

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since my change doesn't help much on M2, I changed it to be off by default on that platform.

Apart from the dip at BS=8, on my machine it does help - at BS=512 the GEMM in this PR is almost 2x faster. This is with LLAMA_NO_ACCELERATE=1 though which disables the Apple's CBLAS implementation from the Accelerate framework - for large BS this remains more efficient. Anyway, we can refine in the future

Regarding GGML_USE_LLAMAFILE - as it is, when I upstream the changes to the ggml repo, the build will fail because there is no sgemm.cpp there. My idea was in the llama.cpp Makefile and CMake to define GGML_USE_LLAMAFILE=1 by default (unless LLAMA_NO_LLAMAFILE is set). I can of course add GGML_USE_LLAMAFILE=0 in the ggml repo, but it's better to have this as the default for now

@jart
Copy link
Contributor Author

jart commented Apr 16, 2024

I vaguely recall when I was working in an experimental branch, the 8x3 kernel https://twitter.com/JustineTunney/status/1776440470152867930 would make GGML go faster than Accelerate. I've been reluctant to cause too much churn here in the interest of getting this PR in. Is there anything specific you need me to change on my end before this can be merged?

@ggerganov ggerganov merged commit 8cc91dc into ggerganov:master Apr 16, 2024
62 checks passed
@ggerganov
Copy link
Owner

the 8x3 kernel twitter.com/JustineTunney/status/1776440470152867930 would make GGML go faster than Accelerate

I don't think the RPi5 uses the Accelerate framework. AFAIK it's available on Apple devices and the SGEMM that comes with it runs on some sort of specialized AMX coprocessor available in Apple Silicon, which brings extra performance to the table.

tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.