-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate PagedAttention KV-cache memory management for faster inference #1955
Comments
llama.cpp currently only ever serves one user at a time so this optimization is not applicable. |
I assume it would be useful if we want to host the models and have a interface like chat.openai.com? |
Yes, for enterprise use where you have one server generating responses for many users in parallel the optimization would be useful. |
Oh I wasn't aware this was exclusively for a client-server application, that explains why they measure performance in requests/sec 🥲 |
this optimization is still applicable as it can save vram usage of kv tensor. |
If we do end up building this for server use and I think that would be a good idea. Then this paging system would be very useful. |
Read through the blog and the code. It turns out the paged attention is a way to manage the memory so that the compute kernel doesn't require kv have to be continues. This make it possible that you can have one prompt's kv append by multi output's KVs. like the following
This is super helpful if your prompt is long and you need to output multi results. This is a purely engineering trick. The change is mainly around the how we manage the KV in VRAM. If we are using CPU, this is even simpler to implement. (simple as list v.s. vector) |
We allocate all the KV memory required for the maximum context length on startup in one block, so we shouldn't have any fragmentation either. |
@JohannesGaessler Is serving multiple users concurrently or batch inference on the roadmap of llama.cpp? |
I don't have any plans for it because I don't care about commercial use but I can't speak for the other devs. |
Should it not be on the list ?
Today we are talking about chatbots, in 6 months or so, people will start
looking for autonomous agents.
Would it not make sense to build a system that can process multiple
requests simultaneously and efficiently ?
…On Sun, 25 Jun 2023 at 4:20 PM, Rand Xie ***@***.***> wrote:
@JohannesGaessler <https://github.com/JohannesGaessler> Is serving
multiple users concurrently or batch inference on the roadmap of llama.cpp?
—
Reply to this email directly, view it on GitHub
<#1955 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4HR4DKJ5ZMVHMQ2T7DXNAJWLANCNFSM6AAAAAAZN5MVXY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Yeah, I think first we need to solve batch inference. It's implemented in babyllama but I'm haven't tried to port it over to the main llama yet |
I'm not really concerned with what other people want to use llama.cpp for. I'm implementing things that are useful for me personally first and foremost. And I don't see how I would benefit from batched inference since I only run llama.cpp for myself on my own hardware. |
That's fair, batch inference would be useful for me use this at scale. For example if I want to do sentiment analysis for a large dataset or summarisation at scale. |
And in this case having a server to handle multiple users at the same time |
I have a comparison for the pytorch implementations with and without paging on a single GPU and the gains are significant. My use case is primarily batch inference, so I am not sure about model serving. WIth a 40 GB A100 GPU Inference on a vicuna-13B model without paged attention produces 20 tokens / sec So the speedup is almost 10x. Obviously this is a bit skewed because our workload involves using the same initial prompt prefix in a batch inference setting so there might be good reuse of the KV cache which is helped by Paged Attention. |
Thanks Vikash. You mentioned in another thread, that there may be some misalignment in terms of understanding, in this thread, on how vllm works. Could you please explain what you meant by it ? Also, there have been other comments in terms its effect on CPU, GPU and Mac M1/M2 GPU in terms of performance. Could you or someone else shed some light on it ? |
From what I understand this isn't so much related to multi-user/client-server use case so much as it it is batched inference, which does seem to be a valid use case even for single-user/local apps, depending on the use case |
Wouldn’t the decreased memory requirement (they state that they cut 55% memory usage) be positive when running inference on smaller devices like phones and laptops as well? |
Should be useful if there's a large context. |
Both vLLM and lmDeploy have high throughput batch-inference modes with various tricks. Problem is they don't support GGUF. How complex would it be to port those tricks (KV cache paging, dynamic batching) to llama.cpp? |
#2813 only covers "same prompt, multiple output", not "multiple prompt, multiple output". |
Would like to voice my support for this, over at the KoboldAI community we had requests for multi-user support and it would also help out our Horde platform which currently benefits from TGI's speed but TGI has poor output for us compared to Llamacpp. Having Llamacpp be fast for these use cases means multiple communities would begin using it as a general purpose inference server which would be a cool addition for the project (Once multiple requests can be queued up). |
I think this feature is important to make llama cpp usage spread even more |
Which one would be easier? Porting performance/throughput tricks into llama.cpp or porting GGUF support into vLLM? (lmDeploy is out of the picture, since they don't want to support GGUF. They closed the feature request / suggestion ticket, since they want to concentrate on other things.) |
IMO, implementing the same idea inside llama.cpp is much better. Currently, vllm leverages Pytorch extension to customize the attention kernel. One benefit of llama.cpp is that it gets rid of pytorch and is more friendly to edge deployment. We can consider porting the kernels in vllm into llama.cpp. It probably requires a certain amount of refactoring in llama.cpp though.. |
Where is the KVCacheManager implemented, is it on the GPU or host (CPU)? |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
很多的一项优化,居然没有人想集成进来!!! |
Please discuss in english here, and would you please elaborate which feature as of today no one want to integrate ? |
Worth re-opening? the |
And it can help reduce gpu-memory usage.I think it's time to start work |
New research just came out on using a technique inspired by kernel virtual memory and pages to manage the KV cache.
Results? Way faster inference!
https://vllm.ai/
They claim up to 24x the throughput (measured in requests handled per second) compared to huggingface's transformers library
How?
Inference is bottlenecked by memory, most notably the KV cache. They say the KV cache's most notable features are
PagedAttention is an alternative approach to managing the KV cache which is inspired by virtual memory, pages and blocks. By allocating the space dynamically with this approach, only up to about 4% of memory will be wasted, instead of the aforementioned 60-80.
For further details, better refer to their website and Github.
The text was updated successfully, but these errors were encountered: