Skip to content

falcon : speed-up prompt processing #2850

Closed
@ggerganov

Description

@ggerganov

The performance of Falcon 7B should be comparable to LLaMA 7B since the computation graph is computationally very similar.

Here are the current numbers on M2 Ultra for LLaMA, LLaMA-v2 and Falcon 7B:

../scripts/run-all-perf.sh ${model} "f16 q8_0 q4_0"
model size params backend ngl test t/s
LLaMA 7B mostly F16 12.55 GiB 6.74 B Metal 999 pp 512 665.95 ± 0.18
LLaMA 7B mostly Q8_0 6.64 GiB 6.74 B Metal 999 pp 512 630.28 ± 0.16
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 999 pp 512 632.32 ± 0.22
LLaMA 7B mostly F16 12.55 GiB 6.74 B Metal 999 tg 64 29.73 ± 0.01
LLaMA 7B mostly Q8_0 6.64 GiB 6.74 B Metal 999 tg 64 61.47 ± 0.06
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 999 tg 64 86.96 ± 0.08

build: dd0dc36 (1100)

model size params backend ngl test t/s
llama2 7B mostly F16 12.55 GiB 6.74 B Metal 999 pp 512 666.12 ± 0.10
llama2 7B mostly Q8_0 6.64 GiB 6.74 B Metal 999 pp 512 630.21 ± 0.20
llama2 7B mostly Q4_0 3.56 GiB 6.74 B Metal 999 pp 512 632.32 ± 0.17
llama2 7B mostly F16 12.55 GiB 6.74 B Metal 999 tg 64 29.74 ± 0.02
llama2 7B mostly Q8_0 6.64 GiB 6.74 B Metal 999 tg 64 61.55 ± 0.04
llama2 7B mostly Q4_0 3.56 GiB 6.74 B Metal 999 tg 64 86.88 ± 0.08

build: dd0dc36 (1100)

model size params backend ngl test t/s
Falcon 7B mostly F16 13.44 GiB 7.22 B Metal 999 pp 512 403.68 ± 1.27
Falcon 7B mostly Q8_0 7.14 GiB 7.22 B Metal 999 pp 512 390.41 ± 1.77
Falcon 7B mostly Q4_0 3.92 GiB 7.22 B Metal 999 pp 512 390.94 ± 1.75
Falcon 7B mostly F16 13.44 GiB 7.22 B Metal 999 tg 64 29.47 ± 0.01
Falcon 7B mostly Q8_0 7.14 GiB 7.22 B Metal 999 tg 64 60.01 ± 0.05
Falcon 7B mostly Q4_0 3.92 GiB 7.22 B Metal 999 tg 64 86.07 ± 0.02

build: 611363a (1110)

Although the Text Generation speed for Falcon is comparable to LLaMA, I observe a significant performance drop in the Prompt Processing task. This is on M2 Ultra with Metal, but I think last time I checked, the CUDA performance experiences a similar drop.

Hypothesis

I haven't profiled the run yet, but I suspect the cause is in the concatenated QKV matrix multiplication (MM):

https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/llama.cpp#L2634-L2638

For some reason, this is probably slower compared to what we do in LLaMA, where we have separated the QKV tensor into 3 individual Q, K and V tensors:

https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/llama.cpp#L2292-L2306

We should either speed-up the current QKV implementation, or change the convert script to output:

  • LLM_TENSOR_ATTN_Q
  • LLM_TENSOR_ATTN_K
  • LLM_TENSOR_ATTN_V

instead of:

  • LLM_TENSOR_ATTN_QKV

https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/convert-falcon-hf-to-gguf.py#L220-L239

My intuition is that if this is indeed the reason for the slow-down in Falcon, the combined QKV approach, if optimized correctly, should yield better performance than the other since we do one single MM instead of 3 separate MMs. So we should also consider switching LLaMA graph if this turns out to be the case and we are able to optimize it and the improvement is significant

Edit: some more results with CUDA on RTX 4080

../scripts/run-all-perf.sh ${model} "f16 q8_0 q4_0" "-ngl 999 -t 1 -n 64"
model size params backend ngl threads test t/s
llama2 7B mostly F16 12.55 GiB 6.74 B CUDA 999 1 pp 512 2122.08 ± 0.58
llama2 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 999 1 pp 512 3343.20 ± 9.31
llama2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 999 1 pp 512 3439.35 ± 10.13
llama2 7B mostly F16 12.55 GiB 6.74 B CUDA 999 1 tg 64 45.95 ± 0.01
llama2 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 999 1 tg 64 77.85 ± 0.01
llama2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 999 1 tg 64 130.43 ± 0.01

build: 611363a (1110)

model size params backend ngl threads test t/s
Falcon 7B mostly F16 13.44 GiB 7.22 B CUDA 999 1 pp 512 1885.04 ± 2.29
Falcon 7B mostly Q8_0 7.14 GiB 7.22 B CUDA 999 1 pp 512 2849.60 ± 5.96
Falcon 7B mostly Q4_0 3.92 GiB 7.22 B CUDA 999 1 pp 512 2754.78 ± 6.54
Falcon 7B mostly F16 13.44 GiB 7.22 B CUDA 999 1 tg 64 36.07 ± 0.01
Falcon 7B mostly Q8_0 7.14 GiB 7.22 B CUDA 999 1 tg 64 54.68 ± 0.03
Falcon 7B mostly Q4_0 3.92 GiB 7.22 B CUDA 999 1 tg 64 75.76 ± 0.01

build: 611363a (1110)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions