Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage and allocate enough memory for largest context #473

Merged
merged 6 commits into from
Mar 24, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Mar 24, 2023

  • Utilize ggml scratch buffers to reduce memory usage (see Reduce memory usage during Whisper inference whisper.cpp#431 for more info)
  • Disable BLAS for matrix multiplications where src0 is quantized. In such cases, we allocate too much memory and the performance is not really better
  • Move KV memory into new struct llama_kv_cache
  • Switch to F16 KV cache by default
  • Add --mtest argument for running memory test in worst case scenario (i.e. max tokens / max batch size, etc.). Will be moved to separate program
  • Print required memory usage upon start + memory needed for separate decoders

These prepares for the introduction of llama_state which in the future will hold the KV cache for each separate decoder.

Need help with running the larger models with -c 2048 and see if it works OK

@Green-Sky
Copy link
Collaborator

Green-Sky commented Mar 24, 2023

loading the 30B q4_1 immediately fails.

p model.type
$1 = MODEL_UNKNOWN
#8  llama_model_load (vocab_only=false, memory_type=GGML_TYPE_F16, n_parts=4, n_ctx=<optimized out>, lctx=..., fname="models/30B/ggml-model-q4_1.bin") at llama.cpp:491
491	            MEM_REQ_SCRATCH0.at(model.type) +

edit:
n_layer = 60

@Green-Sky
Copy link
Collaborator

Green-Sky commented Mar 24, 2023

running 30B q4_1 now and piping a large textfile into it. gonna take a while.

llama_model_load: loading model from 'models/30B/ggml-model-q4_1.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 6656
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot   = 128
llama_model_load: f16     = 3
llama_model_load: n_ff    = 17920
llama_model_load: n_parts = 4
llama_model_load: type    = 3
llama_model_load: ggml ctx size = 26389.16 MB
llama_model_load: mem required  = 28693.16 MB (+ 3124.00 MB per state)
llama_model_load: loading model part 1/4 from 'models/30B/ggml-model-q4_1.bin'
llama_model_load: ................................................................... done
llama_model_load: model size =  5819.56 MB / num tensors = 543
llama_model_load: loading model part 2/4 from 'models/30B/ggml-model-q4_1.bin.1'
llama_model_load: ................................................................... done
llama_model_load: model size =  5819.56 MB / num tensors = 543
llama_model_load: loading model part 3/4 from 'models/30B/ggml-model-q4_1.bin.2'
llama_model_load: ................................................................... done
llama_model_load: model size =  5819.56 MB / num tensors = 543
llama_model_load: loading model part 4/4 from 'models/30B/ggml-model-q4_1.bin.3'
llama_model_load: ................................................................... done
llama_model_load: model size =  5819.56 MB / num tensors = 543
llama_init_from_file: kv self size  = 3120.00 MB

edit: in gdb with -O1 -g

@rabidcopy
Copy link
Contributor

rabidcopy commented Mar 24, 2023

This looks promising though I'm not able to use --memory_f32 with -c 2048 even on a 7B model while I can easily do that on 13B before with the temporary hardcoded 2048 memory buffer size without needing to use --memory_f16. Memory usage in my case has appeared to go up, oddly. 13B weights on my 16GB RAM system aren't usable with this PR and begins to eat up swap space and hangs my system where previously it didn't.
image
First RAM graph peak is current master running ./main -m '[7B-model]' -c 2048 --color -t 6 --temp 2 --top_k 30 --top_p 0.18 --repeat_last_n 500 --repeat_penalty 1.15 -p "write a poem about cats" -b 6 -n 256 -s 5 --memory_f16 (can drop --memory_f16)
Second peak is this PR running ./main -m '[7B-model]' -c 2048 --color -t 6 --temp 2 --top_k 30 --top_p 0.18 --repeat_last_n 500 --repeat_penalty 1.15 -p "write a poem about cats" -b 6 -n 256 -s 5 (can't do --memory_f32)
Output of both runs are identical with 10ms per token speed difference. The PR being faster.

Then another test with 13B weights. First peak is master without --memory_f16. Second peak is PR without --memory_f32.

image

@Green-Sky
Copy link
Collaborator

hangs my system where previously it didn't.

it now allocates all memory upfront. before it did not.

The PR being faster.

thought that too, even with -O1, but have to check again.

@ggerganov
Copy link
Owner Author

ggerganov commented Mar 24, 2023

@rabidcopy Should be fixed now --memory_f32

Btw, memory usage is higher on your plot now, because the memory for the entire context is pre-allocated at the start to make sure there is enough of it. But if you compare the old version with fully generated context, it will use more memory compared to the new version.

@Green-Sky
Copy link
Collaborator

btw, @ggerganov thoughts on using some thread pooling in ggml? i think this places a lower bound on speed per eval, especially on windows.

@Green-Sky
Copy link
Collaborator

Ok, my run was successful 🎉 , except it still has the unrelated, but i think often reported, run past context size in interactive mode bug.

Thread 1 "main" received signal SIGSEGV, Segmentation fault.
0x0000555555570019 in ggml_element_size (tensor=0x7ff9246740f0) at ggml.c:2555
2555	    return GGML_TYPE_SIZE[tensor->type];
(gdb) up
#1  0x000055555557dbb6 in llama_eval_internal (lctx=..., tokens=<optimized out>, n_tokens=1, n_past=2049, n_threads=<optimized out>) at llama.cpp:866

n_past=2049

@ggerganov
Copy link
Owner Author

@Green-Sky I've made experiments with thread pools, but couldn't make it work better then the existing implementation.

Ok, my run was successful 🎉 , except it still has the unrelated, but i think often reported, run past context size in interactive mode bug.

Thanks, this is next on the todo list to fix.

@j-f1
Copy link
Collaborator

j-f1 commented Mar 24, 2023

FWIW, I also have been seeing @Green-Sky’s error above when generating the full n_ctx (e.g. 512) tokens of output and then attempting to start the generation over again.

@rabidcopy
Copy link
Contributor

@rabidcopy Should be fixed now --memory_f32

Btw, memory usage is higher on your plot now, because the memory for the entire context is pre-allocated at the start to make sure there is enough of it. But if you compare the old version with fully generated context, it will use more memory compared to the new version.

Ah, that makes sense. Woops.

@ggerganov
Copy link
Owner Author

@j-f1
I am planning to implement the "swap" idea from this comment: #71 (comment)

I will also add an option to prefix the last half of the context with the initial prompt, which I think is necessary to make the chat bot not forget it's main instructions.

When we have this added we will finally have an infinite chat that never crashes.

@ggerganov ggerganov merged commit 7a9b6c3 into master Mar 24, 2023
@ggerganov ggerganov deleted the mem-fix branch March 24, 2023 21:17
@rabidcopy
Copy link
Contributor

which I think is necessary to make the chat bot not forget it's main instructions.

Definitely. Other projects I've seen that incorporate chat bot features with personalities always pass the initial prompt so it remembers how they behave. This would be really cool to have coupled with infinite output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants