Reduce memory usage and allocate enough memory for largest context #473

ggerganov · 2023-03-24T19:53:02Z

Utilize ggml scratch buffers to reduce memory usage (see Reduce memory usage during Whisper inference whisper.cpp#431 for more info)
~~Disable BLAS for matrix multiplications where src0 is quantized. In such cases, we allocate too much memory and the performance is not really better~~
Move KV memory into new struct llama_kv_cache
Switch to F16 KV cache by default
Add --mtest argument for running memory test in worst case scenario (i.e. max tokens / max batch size, etc.). Will be moved to separate program
Print required memory usage upon start + memory needed for separate decoders

These prepares for the introduction of llama_state which in the future will hold the KV cache for each separate decoder.

Need help with running the larger models with -c 2048 and see if it works OK

Green-Sky · 2023-03-24T20:14:10Z

loading the 30B q4_1 immediately fails.

p model.type
$1 = MODEL_UNKNOWN

#8  llama_model_load (vocab_only=false, memory_type=GGML_TYPE_F16, n_parts=4, n_ctx=<optimized out>, lctx=..., fname="models/30B/ggml-model-q4_1.bin") at llama.cpp:491
491	            MEM_REQ_SCRATCH0.at(model.type) +

edit:
n_layer = 60

Green-Sky · 2023-03-24T20:23:03Z

running 30B q4_1 now and piping a large textfile into it. gonna take a while.

llama_model_load: loading model from 'models/30B/ggml-model-q4_1.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 6656
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot   = 128
llama_model_load: f16     = 3
llama_model_load: n_ff    = 17920
llama_model_load: n_parts = 4
llama_model_load: type    = 3
llama_model_load: ggml ctx size = 26389.16 MB
llama_model_load: mem required  = 28693.16 MB (+ 3124.00 MB per state)
llama_model_load: loading model part 1/4 from 'models/30B/ggml-model-q4_1.bin'
llama_model_load: ................................................................... done
llama_model_load: model size =  5819.56 MB / num tensors = 543
llama_model_load: loading model part 2/4 from 'models/30B/ggml-model-q4_1.bin.1'
llama_model_load: ................................................................... done
llama_model_load: model size =  5819.56 MB / num tensors = 543
llama_model_load: loading model part 3/4 from 'models/30B/ggml-model-q4_1.bin.2'
llama_model_load: ................................................................... done
llama_model_load: model size =  5819.56 MB / num tensors = 543
llama_model_load: loading model part 4/4 from 'models/30B/ggml-model-q4_1.bin.3'
llama_model_load: ................................................................... done
llama_model_load: model size =  5819.56 MB / num tensors = 543
llama_init_from_file: kv self size  = 3120.00 MB

edit: in gdb with -O1 -g

rabidcopy · 2023-03-24T20:51:56Z

This looks promising though I'm not able to use --memory_f32 with -c 2048 even on a 7B model while I can easily do that on 13B before with the temporary hardcoded 2048 memory buffer size without needing to use --memory_f16. Memory usage in my case has appeared to go up, oddly. 13B weights on my 16GB RAM system aren't usable with this PR and begins to eat up swap space and hangs my system where previously it didn't.

First RAM graph peak is current master running ./main -m '[7B-model]' -c 2048 --color -t 6 --temp 2 --top_k 30 --top_p 0.18 --repeat_last_n 500 --repeat_penalty 1.15 -p "write a poem about cats" -b 6 -n 256 -s 5 --memory_f16 (can drop --memory_f16)
Second peak is this PR running ./main -m '[7B-model]' -c 2048 --color -t 6 --temp 2 --top_k 30 --top_p 0.18 --repeat_last_n 500 --repeat_penalty 1.15 -p "write a poem about cats" -b 6 -n 256 -s 5 (can't do --memory_f32)
Output of both runs are identical with 10ms per token speed difference. The PR being faster.

Then another test with 13B weights. First peak is master without --memory_f16. Second peak is PR without --memory_f32.

Green-Sky · 2023-03-24T20:57:37Z

hangs my system where previously it didn't.

it now allocates all memory upfront. before it did not.

The PR being faster.

thought that too, even with -O1, but have to check again.

ggerganov · 2023-03-24T20:58:47Z

@rabidcopy Should be fixed now --memory_f32

Btw, memory usage is higher on your plot now, because the memory for the entire context is pre-allocated at the start to make sure there is enough of it. But if you compare the old version with fully generated context, it will use more memory compared to the new version.

Green-Sky · 2023-03-24T20:58:49Z

btw, @ggerganov thoughts on using some thread pooling in ggml? i think this places a lower bound on speed per eval, especially on windows.

Green-Sky · 2023-03-24T21:06:55Z

Ok, my run was successful 🎉 , except it still has the unrelated, but i think often reported, run past context size in interactive mode bug.

Thread 1 "main" received signal SIGSEGV, Segmentation fault.
0x0000555555570019 in ggml_element_size (tensor=0x7ff9246740f0) at ggml.c:2555
2555	    return GGML_TYPE_SIZE[tensor->type];
(gdb) up
#1  0x000055555557dbb6 in llama_eval_internal (lctx=..., tokens=<optimized out>, n_tokens=1, n_past=2049, n_threads=<optimized out>) at llama.cpp:866

n_past=2049

ggerganov · 2023-03-24T21:08:05Z

@Green-Sky I've made experiments with thread pools, but couldn't make it work better then the existing implementation.

Ok, my run was successful 🎉 , except it still has the unrelated, but i think often reported, run past context size in interactive mode bug.

Thanks, this is next on the todo list to fix.

j-f1 · 2023-03-24T21:08:14Z

FWIW, I also have been seeing @Green-Sky’s error above when generating the full n_ctx (e.g. 512) tokens of output and then attempting to start the generation over again.

rabidcopy · 2023-03-24T21:12:54Z

@rabidcopy Should be fixed now --memory_f32

Btw, memory usage is higher on your plot now, because the memory for the entire context is pre-allocated at the start to make sure there is enough of it. But if you compare the old version with fully generated context, it will use more memory compared to the new version.

Ah, that makes sense. Woops.

ggerganov · 2023-03-24T21:14:31Z

@j-f1
I am planning to implement the "swap" idea from this comment: #71 (comment)

I will also add an option to prefix the last half of the context with the initial prompt, which I think is necessary to make the chat bot not forget it's main instructions.

When we have this added we will finally have an infinite chat that never crashes.

rabidcopy · 2023-03-24T21:21:31Z

which I think is necessary to make the chat bot not forget it's main instructions.

Definitely. Other projects I've seen that incorporate chat bot features with personalities always pass the initial prompt so it remembers how they behave. This would be really cool to have coupled with infinite output.

ggerganov added 3 commits March 24, 2023 21:29

Reduce memory usage and allocate enough memory for large contexts

9330ff0

Simpler scratch buffer usage

ea60d21

Reenable BLAS for quantized mul_mat

3634c31

Fix number of layers in 30B and 65B

0b4e849

Green-Sky mentioned this pull request Mar 24, 2023

dynamic estimate of required memory usage #438

Closed

Fix KV cache size for F32

d0f7519

Merge branch 'master' into mem-fix

6feb572

ggerganov merged commit 7a9b6c3 into master Mar 24, 2023

ggerganov deleted the mem-fix branch March 24, 2023 21:17

chriskuehl mentioned this pull request Mar 25, 2023

Fix crash for 65B model after recent pre-allocated memory change #485

Merged

ggerganov mentioned this pull request Mar 30, 2023

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage and allocate enough memory for largest context #473

Reduce memory usage and allocate enough memory for largest context #473

ggerganov commented Mar 24, 2023 •

edited

Loading

Green-Sky commented Mar 24, 2023 •

edited

Loading

Green-Sky commented Mar 24, 2023 •

edited

Loading

rabidcopy commented Mar 24, 2023 •

edited

Loading

Green-Sky commented Mar 24, 2023

ggerganov commented Mar 24, 2023 •

edited

Loading

Green-Sky commented Mar 24, 2023

Green-Sky commented Mar 24, 2023

ggerganov commented Mar 24, 2023

j-f1 commented Mar 24, 2023

rabidcopy commented Mar 24, 2023

ggerganov commented Mar 24, 2023

rabidcopy commented Mar 24, 2023

Reduce memory usage and allocate enough memory for largest context #473

Reduce memory usage and allocate enough memory for largest context #473

Conversation

ggerganov commented Mar 24, 2023 • edited Loading

Green-Sky commented Mar 24, 2023 • edited Loading

Green-Sky commented Mar 24, 2023 • edited Loading

rabidcopy commented Mar 24, 2023 • edited Loading

Green-Sky commented Mar 24, 2023

ggerganov commented Mar 24, 2023 • edited Loading

Green-Sky commented Mar 24, 2023

Green-Sky commented Mar 24, 2023

ggerganov commented Mar 24, 2023

j-f1 commented Mar 24, 2023

rabidcopy commented Mar 24, 2023

ggerganov commented Mar 24, 2023

rabidcopy commented Mar 24, 2023

ggerganov commented Mar 24, 2023 •

edited

Loading

Green-Sky commented Mar 24, 2023 •

edited

Loading

Green-Sky commented Mar 24, 2023 •

edited

Loading

rabidcopy commented Mar 24, 2023 •

edited

Loading

ggerganov commented Mar 24, 2023 •

edited

Loading