-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage and allocate enough memory for largest context #473
Conversation
loading the 30B q4_1 immediately fails.
edit: |
running 30B q4_1 now and piping a large textfile into it. gonna take a while.
edit: in gdb with -O1 -g |
it now allocates all memory upfront. before it did not.
thought that too, even with -O1, but have to check again. |
@rabidcopy Should be fixed now Btw, memory usage is higher on your plot now, because the memory for the entire context is pre-allocated at the start to make sure there is enough of it. But if you compare the old version with fully generated context, it will use more memory compared to the new version. |
btw, @ggerganov thoughts on using some thread pooling in ggml? i think this places a lower bound on speed per eval, especially on windows. |
Ok, my run was successful 🎉 , except it still has the unrelated, but i think often reported, run past context size in interactive mode bug.
|
@Green-Sky I've made experiments with thread pools, but couldn't make it work better then the existing implementation.
Thanks, this is next on the todo list to fix. |
FWIW, I also have been seeing @Green-Sky’s error above when generating the full |
Ah, that makes sense. Woops. |
@j-f1 I will also add an option to prefix the last half of the context with the initial prompt, which I think is necessary to make the chat bot not forget it's main instructions. When we have this added we will finally have an infinite chat that never crashes. |
Definitely. Other projects I've seen that incorporate chat bot features with personalities always pass the initial prompt so it remembers how they behave. This would be really cool to have coupled with infinite output. |
ggml
scratch buffers to reduce memory usage (see Reduce memory usage during Whisper inference whisper.cpp#431 for more info)Disable BLAS for matrix multiplications wheresrc0
is quantized. In such cases, we allocate too much memory and the performance is not really betterstruct llama_kv_cache
--mtest
argument for running memory test in worst case scenario (i.e. max tokens / max batch size, etc.). Will be moved to separate programThese prepares for the introduction of
llama_state
which in the future will hold the KV cache for each separate decoder.Need help with running the larger models with
-c 2048
and see if it works OK