-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port of self extension to server #5104
Conversation
@ggerganov I doesn't had the time to refactor the code to build a llama API for it, like you mentioned in the issue. But I can do this. |
Thanks @Maximilian-Winter for your work. I'll check it. |
Have found a problem with cache prompt even when self extend isn't enabled. Will fix it asap. |
@Maximilian-Winter cool |
I have found problem with KV cache @Maximilian-Winter |
server with self-extend and with an example of prompt-cache is a use-case for RAG with no need for semantic search and/or vector store |
What is the status of this? |
@ggerganov Will fix this today, sorry for the delay. |
@ggerganov Prompt caching should work as before without self extend. |
Before this gets merged can you update the server README / docs? [Update] it would also be good to add lines in |
@K-Mistele Added descriptions to readme and server print usage, but I'm not sure if my descriptions are totally correct. |
Ok I broke something in self extend. Will fix this now. Sorry. |
@ggerganov I double checked everything and now even prompt caching works with self extend enabled. Maybe you can take a look at my last commit, that added prompt caching and tell me if it is as you intended. |
@K-Mistele I updated the descriptions to make them easier to read and like the other parameters. |
You need to fill the context up. It is 32768 and you pass just 6852 tokens generating 17 new tokens. One way to test this is to set for example |
Ok, I thought doing it like this is enough: start llama.cpp/server -m neural-chat-7b-v3-3.Q8_0.gguf -c 0 -ngl 33 -b 1024 -t 8 |
@ggerganov Can you tell me if that is the correct way doing this? Because I'm not able to trigger the error. start llama.cpp/server -m neural-chat-7b-v3-3.Q8_0.gguf -c 6860 -ngl 33 -b 1024 -t 8
|
@Maximilian-Winter the goal it so process a larger prompt than the value specified by brb git bisecting and making sure its actually this pr |
with
this is what a working context shift looks like:
|
ok, sorry for accusing you. it is in fact not this pr!
a1d6df1 is the first bad commit which makes no sense o.o |
I am restarting the bisect and extend the testing period |
@Green-Sky I only recently started using the llama.cpp server and not the llama-cpp-python bindings. So for me most things are relatively new. I just saw that self extend was added to main looked at the code and transferred the same logic to the server. Because it looked like an easy change. (And it was relatively easy after understanding the server processing pipeline.) |
@Green-Sky Could you pinpoint some issue? |
@Maximilian-Winter sorry for the delay. The bug:
which leads me to believe that the context shift leaks context cache somehow. I am currently looking into which commit causes the changed bug behavior, and after that i will skip the |
Ok thanks. I've noticed these issues, but I haven't looked yet into what is the root cause |
@Green-Sky Thanks for checking that, I thought all day I made a crucial mistake when implementing self extend. If I can help with anything, let me know! |
I stand corrected, looks like the "at least one successful shift" is not necessary, just more likely in practice. Details
|
Okay, so this option does not do what I thought it did then. Where can I learn more about how this works and what appropriate values of it would be? I have just been setting it to the model's training context size. |
@K-Mistele Allow me to cite ggerganov: Next, given that the original training context of the model is T (let's assume T = 2048), you want to set G >= 8192 / T, so in this case: --grp-attn-n 4 or --grp-attn-n 8. The --grp-attn-w corresponds to W from the paper. I think the authors generally used 512, but I think you can go up to T/2 - so in this case --grp-attn-w 1024. Additionally, G has to be multiple of W Hope this helps. |
This is fantastic, thank you! very helpful. |
Hi, i have anecdotal how good is this group attention extension is, Using regular model with large context (built in context is 32k), text is about 3000 tokens maybe, it will deteriorate when creating summary, but using self extension, it can do it. So this is a real breakthrough for open source community |
I did more digging and it appears the first commit that should work with context shift appears to be working. 57dd55e
(modified log level of "input truncated") I am not 100% the behavior is correct/as expected. It is kinda hard to tell. |
Could you give me some sample |
@Maximilian-Winter @Green-Sky Please take a look at #5195 and see if the context shift issues are resolved and if self-extend still functions as expected |
What happens if I use a 4096-context model that has Rope scaling built-in and I use For example, using a command like The model card says,
Does this mean both self extend and Rope scaling will be applied? What would the expected outcome be here? It seems like if I explicitly enable self-extend, then Rope should be disabled. |
@Maximilian-Winter it looks like |
* Ported self extension to server example * Update server.cpp * Fixed prompt caching without self extend * Update server.cpp * Added description to server readme. * Update server.cpp * Update server.cpp * Update server.cpp * Update server.cpp * Update README.md * Changed descriptions * server : formatting * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update server.cpp * Update server.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Ported self extension to server example * Update server.cpp * Fixed prompt caching without self extend * Update server.cpp * Added description to server readme. * Update server.cpp * Update server.cpp * Update server.cpp * Update server.cpp * Update README.md * Changed descriptions * server : formatting * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update server.cpp * Update server.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The author of the self-extend paper dropped what he considers to be a better version of the empirical formula in my twitter replies, would it be possible to update the implementation? https://x.com/serendip410/status/1782957763997401553 |
Hi, I ported the code for self extension over to the server. I have tested it with a information retrieval, I inserted information out of context into a ~6500 tokens long text and it worked, at least with one slot, I tested multiple request, one after the other, and it gives the same result or similar to main.( I took a random seed.) I'm not sure if anything is correct for use with multiple slots, because I can't really test this on my machine.
I tested with solar-10.7b-instruct-v1.0.Q5_K_M.gguf (4096 trained context) and settings for -c 16384 and --grp-attn-n 4 --grp-attn-w 2048