Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) #4518

kalomaze · 2023-12-18T05:43:16Z

Possible Implementations

KV Cache / Context prioritization

The kv layers are quite small for Mixtral on moderate context sizes (e.g. 4096 ctx), as it uses Grouped Query Attention, which may make it beneficial to prioritize them instead of splitting the KV cache equally amongst all layers.

I may be wrong about this, but considering the split KV layers PR massively improves prompt processing speed proportional to how many layers are offloaded to the GPU, I think it'd be much more viable to prioritize the smaller GQA kv layers as being offloaded first rather than evenly distributing them for MoE setups.

(This is the old test I did with prompt processing on a 13b, which has a much larger KV cache...)

Splitting 'Grouped Layers'

In addition to this, it may be most beneficial to keep as many full experts in VRAM as possible, so that the slowdown only applies to one or two particular experts which have their layers (or some of their layers) in regular memory.

The way I interpret how llama.cpp handles it right now is that each layer you offload via -ngl is actually 8 hidden layers for each Mixtral expert (my assumption is, the -ngl actually specifies 'layer groups').
Wouldn't it be wiser to offload as many full experts as possible + the KV cache, or would you get a net loss in terms of parallelization efficiency?

If most of the model's actual matmul is happening in VRAM except for one or two odd experts, I think this could be greatly beneficial for overall inference speed, since prompt processing is the number 1 issue that currently plagues memory bound users / offloading inference in general.

7b generation speeds are fast enough on pure CPU for this to make sense to me.

ggerganov · 2023-12-18T07:56:16Z

We can play with these ideas, though I don't expect much.

Prioritising the KV layers seems more viable, but it would also increase the host <-> device transfers, so not sure if there will be a net positive.

I don't expect offloading full experts to help because it seems that each layers chooses an expert with very even probabilities. Haven't done detailed stats on this, so I could be wrong.

But for sure we can experiment around this.

kalomaze · 2023-12-25T16:51:53Z

@ggerganov:

I don't expect offloading full experts to help because it seems that each layers chooses an expert with very even probabilities. Haven't done detailed stats on this, so I could be wrong.

When looking at the expert counts directly, it would seem they are roughly even in aggregate across many tokens, but...

There is clearly still a preference for some experts per token (even when combining the count of how many times an expert was used across all layers for that particular token), which to me implies that generation could be overall faster if more full experts stayed in VRAM.

Perhaps instead of trying to keep as many full experts in memory as possible, or keeping 1 layer per expert at a time, we could have a balance between the two offloading strategies.

Right now, it's 1 layer offloaded = 1 layer is offloaded for all experts.

Perhaps it could be 1 layer offloaded = 1 layer is offloaded for the next [user specified amount] of target experts.
E.g, 1 layer offloaded = 1 layer is offloaded to the next, let's say 6 experts, and then once all 6 experts are offloaded, it maps the remaining layers to the remaining 2.

As it currently stands, you're having to offload ~25% of a full 7b per layer specified which leads to some wasted VRAM usage (for example, I can offload 13 whole layers but 500mb+ of dedicated VRAM isn't used; 14 layers is too much).
This would help MoE offloading to be more granular, and I think it'd be more effective overall.

github-actions · 2024-03-18T01:35:53Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-02T01:10:39Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

kalomaze added the enhancement New feature or request label Dec 18, 2023

kalomaze changed the title ~~Possible ideas for speeding up CPU inference with Mixtral~~ Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) Dec 18, 2023

github-actions bot added the stale label Mar 18, 2024

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) #4518

Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) #4518

kalomaze commented Dec 18, 2023 •

edited

Loading

ggerganov commented Dec 18, 2023

kalomaze commented Dec 25, 2023 •

edited

Loading

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) #4518

Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) #4518

Comments

kalomaze commented Dec 18, 2023 • edited Loading

Possible Implementations

KV Cache / Context prioritization

Splitting 'Grouped Layers'

ggerganov commented Dec 18, 2023

kalomaze commented Dec 25, 2023 • edited Loading

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

kalomaze commented Dec 18, 2023 •

edited

Loading

kalomaze commented Dec 25, 2023 •

edited

Loading