Skip to content

ggml: avoid rebuild of GGML graph for each token (#7456) #8366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

agray3
Copy link
Contributor

@agray3 agray3 commented Jul 8, 2024

Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 8, 2024
@agray3
Copy link
Contributor Author

agray3 commented Jul 8, 2024

Here are the BS 1 inference performance improvements I have measured for this optimization (all on Linux):

Hardware Llama 7B Q4 Llama 13B Q4
H100-SXM + AMD EPYC 7413 6% 5%
H100-PCIe + AMD EPYC 7313 4% 4%
A100-PCIe + Intel 4210R 7% 5%
L40S + AMD EPYC 7763 3% 2%
RTX-4090 + AMD Ryzen 3700X 4% 3%
A40 + Intel 4210R 5% 3%

@slaren @JohannesGaessler @ggerganov this is currently working OK for my local tests but I'd appreciate any further testing from your side to help determine if it needs to be made more robust.

@JohannesGaessler
Copy link
Collaborator

Did you check whether with this optimization the performance for large batch sizes becomes better with CUDA graphs enabled?

@agray3
Copy link
Contributor Author

agray3 commented Jul 8, 2024

Did you check whether with this optimization the performance for large batch sizes becomes better with CUDA graphs enabled?

No, at the moment both this optimization and CUDA graphs are only activated with batch size 1 (investigating BS > 1 is future work).

@agray3
Copy link
Contributor Author

agray3 commented Jul 8, 2024

This is currently causing a segfault for llama-bench - working on it.

@agray3
Copy link
Contributor Author

agray3 commented Jul 8, 2024

This is currently causing a segfault for llama-bench - working on it.

Now fixed.

src/llama.cpp Outdated
Comment on lines 14628 to 14637
ggml_tensor * tmp_tensor = kv_self.k_l[il];
size_t tmp_offset = (ggml_row_size(kv_self.k_l[il]->type, n_embd_k_gqa))*kv_head;
kv_cache_ptrs.push_back(static_cast<char*>(tmp_tensor->data) + tmp_offset);
tmp_tensor = kv_self.v_l[il];
if (cparams.flash_attn) {
tmp_offset = (kv_head)*ggml_row_size(kv_self.v_l[il]->type, n_embd_v_gqa);
} else {
tmp_offset = (kv_head)*ggml_element_size(kv_self.v_l[il]);
}
kv_cache_ptrs.push_back(static_cast<char*>(tmp_tensor->data) + tmp_offset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this going to be hard to adapt to Jamba (#7531) if this assumes alernating ggml_cpy to k_l and v_l in the graph.

(Jamba uses more than only the KV cache; it also has a recurrent state cache, and no simple alternating rule can work with how the layers of Jamba are stacked (some use the KV cache, some don't))

Maybe graph caching would simply be disabled for that kind of model architecture to keep it simpler, but it would be nice for a more general way to update the cached graph. (Or it could be a special case I can handle along with hybrid models, so no need to generalize this first if it's non-trivial.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to see a more general-purpose solution. This is a good PoC to see what are the possible gains, but it is too hacky.

Probably we should profile llama_build_graph() and see if we can brush off any overhead in order to improve the baseline. I think the llm_build_cb could be adding some non-negligible overhead, so it's something to look into. Then maybe the ggml_ calls are adding too much overhead - are we sure those are properly inlined by the compiler? Are there any hotspots that we should try to optimize first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both - I agree it would be better to make this more generic and I'll think about how that can be done. Probably via adding more info to each node at the graph build stage, which can then be checked when updating the KV cache parameters. @ggerganov Note that it's not just llama_build_graph() avoided here, also ggml_backend_sched_split_graph() and ggml_backend_sched_alloc_splits() - see the profile at #7456
Across all these CPU activities the profile is fairly flat so I haven't found any easy overhead optimizations that would have a large effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@compilade I have now improved identification of K and V nodes - it is now cleaner and no longer assumes that they alternate. Can you please check and let me know whether or not this would now work with your case? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now improved identification of K and V nodes - it is now cleaner and no longer assumes that they alternate

I realised that I accidentally left in some stale code related to the previous interleaving, now removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems better, but instead of using a new flag in ggml_tensor, would it be simpler to check if node->src[1]->name of a GGML_OP_CPY node matches "k_cache_view-%d" or "v_cache_view-%d", or "cache_k_l%d (view)" or "cache_v_l%d (view)"? (the last two are the automatic names which ggml_view uses when given k_l or v_l) (the prefixes might be enough, to avoid matching the layer number) (but maybe this would be too many string comparisons?)

Can you please check and let me know whether or not this would now work with your case?

Since this currently requires explicitly adding a new flag where k_l and v_l are ggml_cpy-ed to, this could work with Mamba (which on master uses k_l and v_l to store the recurrent states) as long as the relevant flags are also added there (which isn't yet done at the time of writing, (although it would conflict with #8526 anyway)).

But the approach used in abaf0c6 still assumes that a particular k_l or v_l tensor is never used more than once, which isn't true in #8526, where the logic to auto-defrag the recurrent states separates the non-modified writes from the writes with modified states, which uses each of k_l and v_l twice, but for different ranges.

The other thing to keep in mind is that the approach used should ideally be easily extensible to more than only k_l and v_l, since #7531 adds r_l and s_l. (and uses all four (because they have different sizes), but skips some of them for some layers, which won't work with the current k_cache_ptrs and v_cache_ptrs approach which assumes they are all used no less and no more than once1)

What I'd like is for this to:

Footnotes

  1. note that unlike the multiple uses, zero uses will be easier to fix because when building k_cache_ptrs, for example, not including the layers where hparams.n_head_kv(il) is 0 would fix this, I think, at least for Jamba.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using a new flag in ggml_tensor, would it be simpler to check if node->src[1]->name

Great idea - I didn't realize this info was encoded in the name. I've now actually used this approach to further simplify things by extracting the layer id from the name, which allows direct updates of the k and v parameters rather than the previous 2-stage solution. I think this now addresses your other points:

Allow multiple uses of ggml_cpy per layer per k_l or v_l tensor
Allow omitting some per-layer states for some layers

I think these should both now work OK with the new approach.

Make it possible to relatively easily add new types of per-layer states

This should now be straightforward by augmenting the part where the k and v parameters are updated.
.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agray3 What I see in 3241b3d is better than I expected, especially with extracting the layer number from the name. Nice work! 👍

It definitely now seems like it should work both when there are multiple ggml_cpy per K or V tensor per layer and when not all layers use them. The only thing left regarding recurrent models is to add some cb in build_mamba so that the target views are called k_cache_view and v_cache_view over there too.

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jul 13, 2024
@Nexesenex
Copy link
Contributor

@agray3 : Any follow-up on this PR? I had to revert it on my own LlamaCPP to follow up the recent commits, with regret due to the performance boost it gave me back then.

@agray3
Copy link
Contributor Author

agray3 commented Oct 10, 2024

@agray3 : Any follow-up on this PR? I had to revert it on my own LlamaCPP to follow up the recent commits, with regret due to the performance boost it gave me back then.

See the comment by Georgi above - #8366 (comment). It's not obvious to me that there's any way to adapt this patch so make it acceptable for merging, so it remains a POC for now.

@Nexesenex
Copy link
Contributor

@agray3 : Well, your POC is working. Would you consider to maintain it, so it can be merged without conflicts with current master for those who want to use it? The perf boost is a no brainer for me, and I suppose, for some other folks.

Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.
@agray3 agray3 force-pushed the ag_ggml_graph_caching branch from d9c7b61 to 23214c9 Compare October 11, 2024 10:24
@agray3
Copy link
Contributor Author

agray3 commented Oct 11, 2024

@agray3 : Well, your POC is working. Would you consider to maintain it, so it can be merged without conflicts with current master for those who want to use it? The perf boost is a no brainer for me, and I suppose, for some other folks.

Sure, I have now rebased it on the latest master branch. Let me know if/when it has conflicts again. Glad it is useful.

@Nexesenex
Copy link
Contributor

Fantastic, @agray3. TYVM.

@Nexesenex
Copy link
Contributor

Hey @agray3. I think the PR needs to be adjusted again to the LCPP refactors subsequent to your last adjustment!

@gaugarg-nv
Copy link
Contributor

@gerganov @slaren @agray3 I'm interested in reducing the CPU overhead associated with building the GGML graph and would like to follow up on this PR. In particular, I'd like to brainstorm ideas to make this PR more generic and broadly applicable.

One of the main concerns appears to be the special handling of the KV cache when reusing a cached graph. A potential solution is to introduce a specialized copy operator for the KV cache that fuses the functionality of ggml_view_1d and ggml_cpy into a single operation. This new operator would compute the appropriate offset within the KV cache for inserting the new token, then copy the new KV token to that offset.

The offset in the KV cache can be calculated using the inp_pos tensor, which is already an input to the graph and is used by the ROPE operator.

This approach would also eliminate the need for pointer indirection introduced in PR #9017 to manage the KV cache when CUDA graph mode is enabled. With this change, neither the GGML nor the CUDA graph would need to be modified during the decode phase.

Do you think this idea is worth pursuing? Any other ideas on how I can make this PR acceptable?

@ggerganov
Copy link
Member

ggerganov commented Jun 1, 2025

The way that I think we should implement this is as follows:

  • Make sure that the topology of the graph (i.e. tensors, their shapes and the view offsets) is fully determined by the shapes of the input tensors
  • Given that, before constructing a full graph, we would first build the input tensors and check if their shapes are the same as the ones we had for the previous graph. If they are the same, then don't build a new graph and reuse the previous graph

To achieve that, the only thing that is missing for most models is to make the KV cache storing and loading to be function of the input tensors. Specifically, this is currently what is preventing it:

Storing:

llama.cpp/src/llama-graph.cpp

Lines 1276 to 1280 in e57bb87

// store to KV cache
{
ggml_build_forward_expand(gf, kv_state->cpy_k(ctx0, k_cur, il));
ggml_build_forward_expand(gf, kv_state->cpy_v(ctx0, v_cur, il));
}

Loading:

llama.cpp/src/llama-graph.cpp

Lines 1285 to 1286 in e57bb87

ggml_tensor * k = kv_state->get_k(ctx0, il);
ggml_tensor * v = kv_state->get_v(ctx0, il);

The second part (the loading) is easy to fix. We just have to create the K and V view tensors at the beginning and mark them as inputs (see llm_graph_input_i). Since their shape depends on n_kv and it is padded typically to 256, then the shape of these views will change only once every 256 ubatches for text-generation cases.

The first part however is more difficult to fix. This is because the K and V view tensors in which we store the new KV cache are currently a function of the KV head. For example, here is the K view:

ggml_tensor * k_view = ggml_view_1d(ctx, k,
n_tokens*hparams.n_embd_k_gqa(il),
ggml_row_size(k->type, hparams.n_embd_k_gqa(il))*head_cur);

The offset argument of the view is ggml_row_size(k->type, hparams.n_embd_k_gqa(il))*head_cur). Even though the shape of the view is constant for ubatches with the same number of tokens, the offset changes for every ubatch because of the different head_cur.

One way to fix that is to extend the ggml_cpy operation to take an optional offset from a single-element I64 tensor and move the head_cur offset to that tensor. We would then mark this I64 tensor as graph input. But this might be very hacky, so I am wondering if there is something better we can do.

In any case, the KV head changes are related only for models that use a KV cache. For embedding models such as BERT, we can already start implementing the logic for reusing the previous graph in llama_context::encode().

@gaugarg-nv
Copy link
Contributor

gaugarg-nv commented Jun 1, 2025

Thanks @ggerganov for the quick response.

This is exactly what I was proposing above:

"A potential solution is to introduce a specialized copy operator for the KV cache that fuses the functionality of ggml_view_1d and ggml_cpy into a single operation. This new operator would compute the appropriate offset within the KV cache for inserting the new token, then copy the new KV token to that offset."

One way to fix that is to extend the ggml_cpy operation to take an optional offset from a single-element I64 tensor and move the head_cur offset to that tensor. We would then mark this I64 tensor as graph input. But this might be very hacky, so I am wondering if there is something better we can do.

My proposal above was to use inp_pos tensor to calculate the offset, which is already an input to the graph and is used by the ROPE operator. Do you think using inp_pos to calculate offset makes sense?

@ggerganov
Copy link
Member

Do you think using inp_pos to calculate offset makes sense?

No, the inp_pos contains the positions of the tokens in the sequence, while the head is the offset in the memory buffer. With the unified KV cache, these 2 are completely independent and cannot use one to compute the other.

@compilade
Copy link
Collaborator

compilade commented Jun 1, 2025

Do you think using inp_pos to calculate offset makes sense?

Not all models use inp_pos (e.g. recurrent models don't). Also, the head of the self-attention unified KV cache doesn't necessarily depend on token positions (this is only a coincidence in single-user/single-sequence inference). In multi-user/multi-sequence inference, the inp_pos is unrelated to the KV cache offset.

The dynamic copy operator you're describing sounds a lot like ggml_get_rows, but for copying the values into an existing tensor, kind of like a ggml_set_rows, where the I32 tensor would be destination row indices instead of source (although I don't really know if non-consecutive indices would be used in practice (except maybe to support non-contiguous KV cache slots?1), so a simpler ggml-cpy-like operator with a destination offset coming from an integer tensor would be sufficient (not sure about the API it should have exactly)).

Footnotes

  1. This could also simplify the recurrent state cache, which currently has to keep track of the defrag done with ggml_get_rows and inp_s_copy at each ubatch, although arguably it might be better to keep the written slots contiguous.

@ggerganov
Copy link
Member

kind of like a ggml_set_rows

Very good idea - this might be exactly what we need.

@gaugarg-nv
Copy link
Contributor

gaugarg-nv commented Jun 1, 2025

Thanks, this makes sense.

Do we need a specialized function to handle transposed v-cache or ggml_set_rows will be enough? Check this part of the code:

// note: the V cache is transposed when not using flash attention
v_view = ggml_view_2d(ctx, v, n_tokens, hparams.n_embd_v_gqa(il),
(v->ne[1])*ggml_element_size(v),
(head_cur)*ggml_element_size(v));
v_cur = ggml_transpose(ctx, v_cur);

Update: Never mind, I realized this can be easily handled by creating a view that sets appropriate strides and uses V-cache as column major.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants