Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) #4518

Closed
kalomaze opened this issue Dec 18, 2023 · 4 comments
Labels
enhancement New feature or request stale

Comments

@kalomaze
Copy link
Contributor

kalomaze commented Dec 18, 2023

Possible Implementations

KV Cache / Context prioritization

The kv layers are quite small for Mixtral on moderate context sizes (e.g. 4096 ctx), as it uses Grouped Query Attention, which may make it beneficial to prioritize them instead of splitting the KV cache equally amongst all layers.

image

I may be wrong about this, but considering the split KV layers PR massively improves prompt processing speed proportional to how many layers are offloaded to the GPU, I think it'd be much more viable to prioritize the smaller GQA kv layers as being offloaded first rather than evenly distributing them for MoE setups.

(This is the old test I did with prompt processing on a 13b, which has a much larger KV cache...)
image

Splitting 'Grouped Layers'

In addition to this, it may be most beneficial to keep as many full experts in VRAM as possible, so that the slowdown only applies to one or two particular experts which have their layers (or some of their layers) in regular memory.

The way I interpret how llama.cpp handles it right now is that each layer you offload via -ngl is actually 8 hidden layers for each Mixtral expert (my assumption is, the -ngl actually specifies 'layer groups').
Wouldn't it be wiser to offload as many full experts as possible + the KV cache, or would you get a net loss in terms of parallelization efficiency?

If most of the model's actual matmul is happening in VRAM except for one or two odd experts, I think this could be greatly beneficial for overall inference speed, since prompt processing is the number 1 issue that currently plagues memory bound users / offloading inference in general.

7b generation speeds are fast enough on pure CPU for this to make sense to me.

@kalomaze kalomaze added the enhancement New feature or request label Dec 18, 2023
@kalomaze kalomaze changed the title Possible ideas for speeding up CPU inference with Mixtral Possible ideas for speeding up CPU inference with Mixtral (KV cache prioritization) Dec 18, 2023
@ggerganov
Copy link
Owner

We can play with these ideas, though I don't expect much.

Prioritising the KV layers seems more viable, but it would also increase the host <-> device transfers, so not sure if there will be a net positive.

I don't expect offloading full experts to help because it seems that each layers chooses an expert with very even probabilities. Haven't done detailed stats on this, so I could be wrong.

But for sure we can experiment around this.

@kalomaze
Copy link
Contributor Author

kalomaze commented Dec 25, 2023

@ggerganov:

I don't expect offloading full experts to help because it seems that each layers chooses an expert with very even probabilities. Haven't done detailed stats on this, so I could be wrong.

When looking at the expert counts directly, it would seem they are roughly even in aggregate across many tokens, but...

image

There is clearly still a preference for some experts per token (even when combining the count of how many times an expert was used across all layers for that particular token), which to me implies that generation could be overall faster if more full experts stayed in VRAM.

Perhaps instead of trying to keep as many full experts in memory as possible, or keeping 1 layer per expert at a time, we could have a balance between the two offloading strategies.

Right now, it's 1 layer offloaded = 1 layer is offloaded for all experts.

Perhaps it could be 1 layer offloaded = 1 layer is offloaded for the next [user specified amount] of target experts.
E.g, 1 layer offloaded = 1 layer is offloaded to the next, let's say 6 experts, and then once all 6 experts are offloaded, it maps the remaining layers to the remaining 2.

As it currently stands, you're having to offload ~25% of a full 7b per layer specified which leads to some wasted VRAM usage (for example, I can offload 13 whole layers but 500mb+ of dedicated VRAM isn't used; 14 layers is too much).
This would help MoE offloading to be more granular, and I think it'd be more effective overall.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

2 participants