falcon : speed-up prompt processing

The performance of Falcon 7B should be comparable to LLaMA 7B since the computation graph is computationally very similar.

Here are the current numbers on M2 Ultra for LLaMA, LLaMA-v2 and Falcon 7B:

```bash
../scripts/run-all-perf.sh ${model} "f16 q8_0 q4_0"
```

| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| LLaMA 7B mostly F16            |  12.55 GiB |     6.74 B | Metal      | 999 | pp 512     |    665.95 ± 0.18 |
| LLaMA 7B mostly Q8_0           |   6.64 GiB |     6.74 B | Metal      | 999 | pp 512     |    630.28 ± 0.16 |
| LLaMA 7B mostly Q4_0           |   3.56 GiB |     6.74 B | Metal      | 999 | pp 512     |    632.32 ± 0.22 |
| LLaMA 7B mostly F16            |  12.55 GiB |     6.74 B | Metal      | 999 | tg 64      |     29.73 ± 0.01 |
| LLaMA 7B mostly Q8_0           |   6.64 GiB |     6.74 B | Metal      | 999 | tg 64      |     61.47 ± 0.06 |
| LLaMA 7B mostly Q4_0           |   3.56 GiB |     6.74 B | Metal      | 999 | tg 64      |     86.96 ± 0.08 |

build: dd0dc36 (1100)

| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama2 7B mostly F16           |  12.55 GiB |     6.74 B | Metal      | 999 | pp 512     |    666.12 ± 0.10 |
| llama2 7B mostly Q8_0          |   6.64 GiB |     6.74 B | Metal      | 999 | pp 512     |    630.21 ± 0.20 |
| llama2 7B mostly Q4_0          |   3.56 GiB |     6.74 B | Metal      | 999 | pp 512     |    632.32 ± 0.17 |
| llama2 7B mostly F16           |  12.55 GiB |     6.74 B | Metal      | 999 | tg 64      |     29.74 ± 0.02 |
| llama2 7B mostly Q8_0          |   6.64 GiB |     6.74 B | Metal      | 999 | tg 64      |     61.55 ± 0.04 |
| llama2 7B mostly Q4_0          |   3.56 GiB |     6.74 B | Metal      | 999 | tg 64      |     86.88 ± 0.08 |

build: dd0dc36 (1100)

| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| Falcon 7B mostly F16           |  13.44 GiB |     7.22 B | Metal      | 999 | pp 512     |    403.68 ± 1.27 |
| Falcon 7B mostly Q8_0          |   7.14 GiB |     7.22 B | Metal      | 999 | pp 512     |    390.41 ± 1.77 |
| Falcon 7B mostly Q4_0          |   3.92 GiB |     7.22 B | Metal      | 999 | pp 512     |    390.94 ± 1.75 |
| Falcon 7B mostly F16           |  13.44 GiB |     7.22 B | Metal      | 999 | tg 64      |     29.47 ± 0.01 |
| Falcon 7B mostly Q8_0          |   7.14 GiB |     7.22 B | Metal      | 999 | tg 64      |     60.01 ± 0.05 |
| Falcon 7B mostly Q4_0          |   3.92 GiB |     7.22 B | Metal      | 999 | tg 64      |     86.07 ± 0.02 |

build: 611363a (1110)

Although the Text Generation speed for Falcon is comparable to LLaMA, I observe a significant performance drop in the Prompt Processing task. This is on M2 Ultra with Metal, but I think last time I checked, the CUDA performance experiences a similar drop.

### Hypothesis

I haven't profiled the run yet, but I suspect the cause is in the concatenated `QKV` matrix multiplication (MM):

https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/llama.cpp#L2634-L2638

For some reason, this is probably slower compared to what we do in LLaMA, where we have separated the `QKV` tensor into 3 individual `Q`, `K` and `V` tensors:

https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/llama.cpp#L2292-L2306

We should either speed-up the current `QKV` implementation, or change the convert script to output:

- `LLM_TENSOR_ATTN_Q`
- `LLM_TENSOR_ATTN_K`
- `LLM_TENSOR_ATTN_V`

instead of:

- `LLM_TENSOR_ATTN_QKV`

https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/convert-falcon-hf-to-gguf.py#L220-L239

My intuition is that if this is indeed the reason for the slow-down in Falcon, the combined `QKV` approach, if optimized correctly, should yield better performance than the other since we do one single MM instead of 3 separate MMs. So we should also consider switching LLaMA graph if this turns out to be the case and we are able to optimize it and the improvement is significant

Edit: some more results with CUDA on RTX 4080

```bash
../scripts/run-all-perf.sh ${model} "f16 q8_0 q4_0" "-ngl 999 -t 1 -n 64"
```

| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama2 7B mostly F16           |  12.55 GiB |     6.74 B | CUDA       | 999 |          1 | pp 512     |   2122.08 ± 0.58 |
| llama2 7B mostly Q8_0          |   6.67 GiB |     6.74 B | CUDA       | 999 |          1 | pp 512     |   3343.20 ± 9.31 |
| llama2 7B mostly Q4_0          |   3.56 GiB |     6.74 B | CUDA       | 999 |          1 | pp 512     |  3439.35 ± 10.13 |
| llama2 7B mostly F16           |  12.55 GiB |     6.74 B | CUDA       | 999 |          1 | tg 64      |     45.95 ± 0.01 |
| llama2 7B mostly Q8_0          |   6.67 GiB |     6.74 B | CUDA       | 999 |          1 | tg 64      |     77.85 ± 0.01 |
| llama2 7B mostly Q4_0          |   3.56 GiB |     6.74 B | CUDA       | 999 |          1 | tg 64      |    130.43 ± 0.01 |

build: 611363a (1110)

| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| Falcon 7B mostly F16           |  13.44 GiB |     7.22 B | CUDA       | 999 |          1 | pp 512     |   1885.04 ± 2.29 |
| Falcon 7B mostly Q8_0          |   7.14 GiB |     7.22 B | CUDA       | 999 |          1 | pp 512     |   2849.60 ± 5.96 |
| Falcon 7B mostly Q4_0          |   3.92 GiB |     7.22 B | CUDA       | 999 |          1 | pp 512     |   2754.78 ± 6.54 |
| Falcon 7B mostly F16           |  13.44 GiB |     7.22 B | CUDA       | 999 |          1 | tg 64      |     36.07 ± 0.01 |
| Falcon 7B mostly Q8_0          |   7.14 GiB |     7.22 B | CUDA       | 999 |          1 | tg 64      |     54.68 ± 0.03 |
| Falcon 7B mostly Q4_0          |   3.92 GiB |     7.22 B | CUDA       | 999 |          1 | tg 64      |     75.76 ± 0.01 |

build: 611363a (1110)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

falcon : speed-up prompt processing #2850

Hypothesis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	test	t/s
LLaMA 7B mostly F16	12.55 GiB	6.74 B	Metal	999	pp 512	665.95 ± 0.18
LLaMA 7B mostly Q8_0	6.64 GiB	6.74 B	Metal	999	pp 512	630.28 ± 0.16
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	999	pp 512	632.32 ± 0.22
LLaMA 7B mostly F16	12.55 GiB	6.74 B	Metal	999	tg 64	29.73 ± 0.01
LLaMA 7B mostly Q8_0	6.64 GiB	6.74 B	Metal	999	tg 64	61.47 ± 0.06
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	999	tg 64	86.96 ± 0.08

model	size	params	backend	ngl	test	t/s
llama2 7B mostly F16	12.55 GiB	6.74 B	Metal	999	pp 512	666.12 ± 0.10
llama2 7B mostly Q8_0	6.64 GiB	6.74 B	Metal	999	pp 512	630.21 ± 0.20
llama2 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	999	pp 512	632.32 ± 0.17
llama2 7B mostly F16	12.55 GiB	6.74 B	Metal	999	tg 64	29.74 ± 0.02
llama2 7B mostly Q8_0	6.64 GiB	6.74 B	Metal	999	tg 64	61.55 ± 0.04
llama2 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	999	tg 64	86.88 ± 0.08

model	size	params	backend	ngl	test	t/s
Falcon 7B mostly F16	13.44 GiB	7.22 B	Metal	999	pp 512	403.68 ± 1.27
Falcon 7B mostly Q8_0	7.14 GiB	7.22 B	Metal	999	pp 512	390.41 ± 1.77
Falcon 7B mostly Q4_0	3.92 GiB	7.22 B	Metal	999	pp 512	390.94 ± 1.75
Falcon 7B mostly F16	13.44 GiB	7.22 B	Metal	999	tg 64	29.47 ± 0.01
Falcon 7B mostly Q8_0	7.14 GiB	7.22 B	Metal	999	tg 64	60.01 ± 0.05
Falcon 7B mostly Q4_0	3.92 GiB	7.22 B	Metal	999	tg 64	86.07 ± 0.02

falcon : speed-up prompt processing #2850

Description

Hypothesis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions