Skip to content

Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) #4518

Closed
@kalomaze

Description

@kalomaze

Possible Implementations

KV Cache / Context prioritization

The kv layers are quite small for Mixtral on moderate context sizes (e.g. 4096 ctx), as it uses Grouped Query Attention, which may make it beneficial to prioritize them instead of splitting the KV cache equally amongst all layers.

image

I may be wrong about this, but considering the split KV layers PR massively improves prompt processing speed proportional to how many layers are offloaded to the GPU, I think it'd be much more viable to prioritize the smaller GQA kv layers as being offloaded first rather than evenly distributing them for MoE setups.

(This is the old test I did with prompt processing on a 13b, which has a much larger KV cache...)
image

Splitting 'Grouped Layers'

In addition to this, it may be most beneficial to keep as many full experts in VRAM as possible, so that the slowdown only applies to one or two particular experts which have their layers (or some of their layers) in regular memory.

The way I interpret how llama.cpp handles it right now is that each layer you offload via -ngl is actually 8 hidden layers for each Mixtral expert (my assumption is, the -ngl actually specifies 'layer groups').
Wouldn't it be wiser to offload as many full experts as possible + the KV cache, or would you get a net loss in terms of parallelization efficiency?

If most of the model's actual matmul is happening in VRAM except for one or two odd experts, I think this could be greatly beneficial for overall inference speed, since prompt processing is the number 1 issue that currently plagues memory bound users / offloading inference in general.

7b generation speeds are fast enough on pure CPU for this to make sense to me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions