Description
The performance of Falcon 7B should be comparable to LLaMA 7B since the computation graph is computationally very similar.
Here are the current numbers on M2 Ultra for LLaMA, LLaMA-v2 and Falcon 7B:
../scripts/run-all-perf.sh ${model} "f16 q8_0 q4_0"
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
LLaMA 7B mostly F16 | 12.55 GiB | 6.74 B | Metal | 999 | pp 512 | 665.95 ± 0.18 |
LLaMA 7B mostly Q8_0 | 6.64 GiB | 6.74 B | Metal | 999 | pp 512 | 630.28 ± 0.16 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 999 | pp 512 | 632.32 ± 0.22 |
LLaMA 7B mostly F16 | 12.55 GiB | 6.74 B | Metal | 999 | tg 64 | 29.73 ± 0.01 |
LLaMA 7B mostly Q8_0 | 6.64 GiB | 6.74 B | Metal | 999 | tg 64 | 61.47 ± 0.06 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 999 | tg 64 | 86.96 ± 0.08 |
build: dd0dc36 (1100)
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama2 7B mostly F16 | 12.55 GiB | 6.74 B | Metal | 999 | pp 512 | 666.12 ± 0.10 |
llama2 7B mostly Q8_0 | 6.64 GiB | 6.74 B | Metal | 999 | pp 512 | 630.21 ± 0.20 |
llama2 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 999 | pp 512 | 632.32 ± 0.17 |
llama2 7B mostly F16 | 12.55 GiB | 6.74 B | Metal | 999 | tg 64 | 29.74 ± 0.02 |
llama2 7B mostly Q8_0 | 6.64 GiB | 6.74 B | Metal | 999 | tg 64 | 61.55 ± 0.04 |
llama2 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 999 | tg 64 | 86.88 ± 0.08 |
build: dd0dc36 (1100)
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
Falcon 7B mostly F16 | 13.44 GiB | 7.22 B | Metal | 999 | pp 512 | 403.68 ± 1.27 |
Falcon 7B mostly Q8_0 | 7.14 GiB | 7.22 B | Metal | 999 | pp 512 | 390.41 ± 1.77 |
Falcon 7B mostly Q4_0 | 3.92 GiB | 7.22 B | Metal | 999 | pp 512 | 390.94 ± 1.75 |
Falcon 7B mostly F16 | 13.44 GiB | 7.22 B | Metal | 999 | tg 64 | 29.47 ± 0.01 |
Falcon 7B mostly Q8_0 | 7.14 GiB | 7.22 B | Metal | 999 | tg 64 | 60.01 ± 0.05 |
Falcon 7B mostly Q4_0 | 3.92 GiB | 7.22 B | Metal | 999 | tg 64 | 86.07 ± 0.02 |
build: 611363a (1110)
Although the Text Generation speed for Falcon is comparable to LLaMA, I observe a significant performance drop in the Prompt Processing task. This is on M2 Ultra with Metal, but I think last time I checked, the CUDA performance experiences a similar drop.
Hypothesis
I haven't profiled the run yet, but I suspect the cause is in the concatenated QKV
matrix multiplication (MM):
For some reason, this is probably slower compared to what we do in LLaMA, where we have separated the QKV
tensor into 3 individual Q
, K
and V
tensors:
We should either speed-up the current QKV
implementation, or change the convert script to output:
LLM_TENSOR_ATTN_Q
LLM_TENSOR_ATTN_K
LLM_TENSOR_ATTN_V
instead of:
LLM_TENSOR_ATTN_QKV
My intuition is that if this is indeed the reason for the slow-down in Falcon, the combined QKV
approach, if optimized correctly, should yield better performance than the other since we do one single MM instead of 3 separate MMs. So we should also consider switching LLaMA graph if this turns out to be the case and we are able to optimize it and the improvement is significant
Edit: some more results with CUDA on RTX 4080
../scripts/run-all-perf.sh ${model} "f16 q8_0 q4_0" "-ngl 999 -t 1 -n 64"
model | size | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|---|
llama2 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 999 | 1 | pp 512 | 2122.08 ± 0.58 |
llama2 7B mostly Q8_0 | 6.67 GiB | 6.74 B | CUDA | 999 | 1 | pp 512 | 3343.20 ± 9.31 |
llama2 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 999 | 1 | pp 512 | 3439.35 ± 10.13 |
llama2 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 999 | 1 | tg 64 | 45.95 ± 0.01 |
llama2 7B mostly Q8_0 | 6.67 GiB | 6.74 B | CUDA | 999 | 1 | tg 64 | 77.85 ± 0.01 |
llama2 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 999 | 1 | tg 64 | 130.43 ± 0.01 |
build: 611363a (1110)
model | size | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|---|
Falcon 7B mostly F16 | 13.44 GiB | 7.22 B | CUDA | 999 | 1 | pp 512 | 1885.04 ± 2.29 |
Falcon 7B mostly Q8_0 | 7.14 GiB | 7.22 B | CUDA | 999 | 1 | pp 512 | 2849.60 ± 5.96 |
Falcon 7B mostly Q4_0 | 3.92 GiB | 7.22 B | CUDA | 999 | 1 | pp 512 | 2754.78 ± 6.54 |
Falcon 7B mostly F16 | 13.44 GiB | 7.22 B | CUDA | 999 | 1 | tg 64 | 36.07 ± 0.01 |
Falcon 7B mostly Q8_0 | 7.14 GiB | 7.22 B | CUDA | 999 | 1 | tg 64 | 54.68 ± 0.03 |
Falcon 7B mostly Q4_0 | 3.92 GiB | 7.22 B | CUDA | 999 | 1 | tg 64 | 75.76 ± 0.01 |
build: 611363a (1110)