Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization)

# Possible Implementations

## KV Cache / Context prioritization

The kv layers are quite small for Mixtral on moderate context sizes (e.g. 4096 ctx), as it uses Grouped Query Attention, which may make it beneficial to prioritize them [instead of splitting the KV cache equally amongst all layers.](https://github.com/ggerganov/llama.cpp/pull/4309)

<img width="713" alt="image" src="https://github.com/ggerganov/llama.cpp/assets/66376113/ad88d95e-70fa-4a01-9048-683598055434">

I may be wrong about this, but considering the split KV layers PR massively improves prompt processing speed proportional to how many layers are offloaded to the GPU, I think it'd be much more viable to prioritize the smaller GQA kv layers as being offloaded _first_ rather than evenly distributing them for MoE setups.

(This is the old test I did with prompt processing on a 13b, which has a much larger KV cache...)
![image](https://github.com/ggerganov/llama.cpp/assets/66376113/44b1ae05-923b-4c04-af49-340f3b383791)

## Splitting 'Grouped Layers'

In addition to this, it may be most beneficial to keep as many full experts in VRAM as possible, so that the slowdown only applies to one or two particular experts which have their layers (or some of their layers) in regular memory.

The way I interpret how llama.cpp handles it right now is that each layer you offload via -ngl is actually 8 hidden layers for each Mixtral expert (my assumption is, the -ngl actually specifies 'layer groups').
Wouldn't it be wiser to offload as many full experts as possible + the KV cache, or would you get a net loss in terms of parallelization efficiency?

If most of the model's actual matmul is happening in VRAM except for one or two odd experts, I think this could be greatly beneficial for overall inference speed, since prompt processing is the number 1 issue that currently plagues memory bound users / offloading inference in general.

7b generation speeds are fast enough on pure CPU for this to make sense to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) #4518

Possible Implementations

KV Cache / Context prioritization

Splitting 'Grouped Layers'

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) #4518

Description

Possible Implementations

KV Cache / Context prioritization

Splitting 'Grouped Layers'

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions