demo : per-layer KV / partial offloading of KV cache #3457

slaren · 2023-10-03T17:38:11Z

Currently, the entire KV cache is allocated as a single tensor for all the layers. As a consequence, the KV cache is either fully on the CPU, or fully offloaded to the GPU.

With this change, the KV cache is allocated on a different tensor per layer. The result is more granular control over the parts of the KV offloaded to the GPU.

In this demo, when partially offloading a model, the KV cache corresponding to the offloaded layers is also offloaded. This increases performance at the expense of more VRAM.

Is it worth it compared to just offloading more layers? I am not sure, but probably wouldn't hurt to have more flexibility.

Note: only implemented for llama models. CUDA only.

Edit: removed a few unnecessary copies that caused performance to degrade.

Llama2 70B on a single 24 GB GPU:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model	size	params	ngl	test	master t/s	PR t/s	speedup
70B mostly Q2_K	27.27 GiB	68.98 B	60	pp 512	83.10 ± 0.26	115.29 ± 0.33	1.39
70B mostly Q2_K	27.27 GiB	68.98 B	60	tg 128	4.21 ± 0.05	4.78 ± 0.05	1.14
70B mostly Q2_K	27.27 GiB	68.98 B	61	pp 512	84.29 ± 0.49	118.47 ± 0.19	1.41
70B mostly Q2_K	27.27 GiB	68.98 B	61	tg 128	4.35 ± 0.03	5.01 ± 0.04	1.15
70B mostly Q2_K	27.27 GiB	68.98 B	62	pp 512	85.28 ± 0.23	121.99 ± 0.38	1.43
70B mostly Q2_K	27.27 GiB	68.98 B	62	tg 128	4.47 ± 0.05	5.67 ± 0.15	1.27
70B mostly Q2_K	27.27 GiB	68.98 B	63	pp 512	86.71 ± 0.23	125.13 ± 0.21	1.44
70B mostly Q2_K	27.27 GiB	68.98 B	63	tg 128	4.61 ± 0.03	6.14 ± 0.01	1.33
70B mostly Q2_K	27.27 GiB	68.98 B	64	pp 512	87.99 ± 0.30	-	1.42 (63)
70B mostly Q2_K	27.27 GiB	68.98 B	64	tg 128	4.74 ± 0.04	-	1.30 (63)
70B mostly Q2_K	27.27 GiB	68.98 B	65	pp 512	89.19 ± 0.23	-	1.40 (63)
70B mostly Q2_K	27.27 GiB	68.98 B	65	tg 128	5.00 ± 0.05	-	1.23 (63)

v1

ggerganov · 2023-10-03T19:09:53Z

Regardless of the performance effects, this is a good change since it makes the KV cache addressing more intuitive

Dampfinchen · 2023-10-10T16:09:35Z

Definately worth it to set fewer layers but get higher prompt processing speed out of it.

oobabooga · 2023-11-16T20:44:42Z

Could this PR, when combined with the performance gains in #3776, allow 70b models in q4_K_M / q4_K_S precision to run on a 3090 at more than 1-2 tokens/second?

ggerganov · 2023-12-03T13:51:35Z

I will try to update this PR to latest master and merge

slaren · 2023-12-03T13:57:05Z

Ok, some notes:

The reason I didn't continue working on this is because of the two copies of KQ_mask and KQ_pos for CPU and GPU. I was hoping to do this automatically with ggml-backend, but after your graph building refactoring it may be possible to do it cleanly now.
The KV cache can get quite big, so it should still be possible to choose to not offload the KV cache for low VRAM situations

ggerganov · 2023-12-03T16:00:49Z

Will leave this PR intact for reference. Opened a new PR: #4309

@oobabooga and anyone else who is interested - would be nice to run some tests with #4309 to make sure it works as expected

per-layer KV

e9bcf66

remove unnecessary copies

55f2f2f

slaren changed the title ~~demo: per-layer KV~~ demo : per-layer KV / partial offloading of KV cache Oct 4, 2023

less code duplication, offload k and v separately

f4f9367

slaren added the demo Demonstrate some concept or idea, not intended to be merged label Oct 11, 2023

ggerganov mentioned this pull request Oct 28, 2023

cuda : improve text-generation and batched decoding performance #3776

Merged

6 tasks

BarfingLemurs mentioned this pull request Nov 30, 2023

Context splitting #4269

Closed

ggerganov self-assigned this Dec 3, 2023

ggerganov mentioned this pull request Dec 3, 2023

llama : per-layer KV cache #4309

Merged

4 tasks

slaren closed this Dec 7, 2023

slaren deleted the per-layer-kv branch December 7, 2023 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demo : per-layer KV / partial offloading of KV cache #3457

demo : per-layer KV / partial offloading of KV cache #3457

slaren commented Oct 3, 2023 •

edited

Loading

ggerganov commented Oct 3, 2023

Dampfinchen commented Oct 10, 2023

oobabooga commented Nov 16, 2023

ggerganov commented Dec 3, 2023

slaren commented Dec 3, 2023

ggerganov commented Dec 3, 2023

demo : per-layer KV / partial offloading of KV cache #3457

demo : per-layer KV / partial offloading of KV cache #3457

Conversation

slaren commented Oct 3, 2023 • edited Loading

ggerganov commented Oct 3, 2023

Dampfinchen commented Oct 10, 2023

oobabooga commented Nov 16, 2023

ggerganov commented Dec 3, 2023

slaren commented Dec 3, 2023

ggerganov commented Dec 3, 2023

slaren commented Oct 3, 2023 •

edited

Loading